Computer Vision and Pattern Recognition 96
☆ FaceXBench: Evaluating Multimodal LLMs on Face Understanding
Multimodal Large Language Models (MLLMs) demonstrate impressive
problem-solving abilities across a wide range of tasks and domains. However,
their capacity for face understanding has not been systematically studied. To
address this gap, we introduce FaceXBench, a comprehensive benchmark designed
to evaluate MLLMs on complex face understanding tasks. FaceXBench includes
5,000 multimodal multiple-choice questions derived from 25 public datasets and
a newly created dataset, FaceXAPI. These questions cover 14 tasks across 6
broad categories, assessing MLLMs' face understanding abilities in bias and
fairness, face authentication, recognition, analysis, localization and tool
retrieval. Using FaceXBench, we conduct an extensive evaluation of 26
open-source MLLMs alongside 2 proprietary models, revealing the unique
challenges in complex face understanding tasks. We analyze the models across
three evaluation settings: zero-shot, in-context task description, and
chain-of-thought prompting. Our detailed analysis reveals that current MLLMs,
including advanced models like GPT-4o, and GeminiPro 1.5, show significant room
for improvement. We believe FaceXBench will be a crucial resource for
developing MLLMs equipped to perform sophisticated face understanding. Code:
https://github.com/Kartik-3004/facexbench
comment: Project Page: https://kartik-3004.github.io/facexbench/
☆ Zero-Shot Monocular Scene Flow Estimation in the Wild
Large models have shown generalization across datasets for many low-level
vision tasks, like depth estimation, but no such general models exist for scene
flow. Even though scene flow has wide potential use, it is not used in practice
because current predictive models do not generalize well. We identify three key
challenges and propose solutions for each.First, we create a method that
jointly estimates geometry and motion for accurate prediction. Second, we
alleviate scene flow data scarcity with a data recipe that affords us 1M
annotated training samples across diverse synthetic scenes. Third, we evaluate
different parameterizations for scene flow prediction and adopt a natural and
effective parameterization. Our resulting model outperforms existing methods as
well as baselines built on large-scale models in terms of 3D end-point error,
and shows zero-shot generalization to the casually captured videos from DAVIS
and the robotic manipulation scenes from RoboTAP. Overall, our approach makes
scene flow prediction more practical in-the-wild.
comment: Project Website: https://research.nvidia.com/labs/zero_msf
☆ 3rd Workshop on Maritime Computer Vision (MaCVi) 2025: Challenge Results
Benjamin Kiefer, Lojze Žust, Jon Muhovič, Matej Kristan, Janez Perš, Matija Teršek, Uma Mudenagudi Chaitra Desai, Arnold Wiliem, Marten Kreis, Nikhil Akalwadi, Yitong Quan, Zhiqiang Zhong, Zhe Zhang, Sujie Liu, Xuran Chen, Yang Yang, Matej Fabijanić, Fausto Ferreira, Seongju Lee, Junseok Lee, Kyoobin Lee, Shanliang Yao, Runwei Guan, Xiaoyu Huang, Yi Ni, Himanshu Kumar, Yuan Feng, Yi-Ching Cheng, Tzu-Yu Lin, Chia-Ming Lee, Chih-Chung Hsu, Jannik Sheikh, Andreas Michel, Wolfgang Gross, Martin Weinmann, Josip Šarić, Yipeng Lin, Xiang Yang, Nan Jiang, Yutang Lu, Fei Feng, Ali Awad, Evan Lucas, Ashraf Saleem, Ching-Heng Cheng, Yu-Fan Lin, Tzu-Yu Lin, Chih-Chung Hsu
The 3rd Workshop on Maritime Computer Vision (MaCVi) 2025 addresses maritime
computer vision for Unmanned Surface Vehicles (USV) and underwater. This report
offers a comprehensive overview of the findings from the challenges. We provide
both statistical and qualitative analyses, evaluating trends from over 700
submissions. All datasets, evaluation code, and the leaderboard are available
to the public at https://macvi.org/workshop/macvi25.
comment: Part of the MaCVi 2025 workshop
☆ DiffStereo: High-Frequency Aware Diffusion Model for Stereo Image Restoration
Diffusion models (DMs) have achieved promising performance in image
restoration but haven't been explored for stereo images. The application of DM
in stereo image restoration is confronted with a series of challenges. The need
to reconstruct two images exacerbates DM's computational cost. Additionally,
existing latent DMs usually focus on semantic information and remove
high-frequency details as redundancy during latent compression, which is
precisely what matters for image restoration. To address the above problems, we
propose a high-frequency aware diffusion model, DiffStereo for stereo image
restoration as the first attempt at DM in this domain. Specifically, DiffStereo
first learns latent high-frequency representations (LHFR) of HQ images. DM is
then trained in the learned space to estimate LHFR for stereo images, which are
fused into a transformer-based stereo image restoration network providing
beneficial high-frequency information of corresponding HQ images. The
resolution of LHFR is kept the same as input images, which preserves the
inherent texture from distortion. And the compression in channels alleviates
the computational burden of DM. Furthermore, we devise a position encoding
scheme when integrating the LHFR into the restoration network, enabling
distinctive guidance in different depths of the restoration network.
Comprehensive experiments verify that by combining generative DM and
transformer, DiffStereo achieves both higher reconstruction accuracy and better
perceptual quality on stereo super-resolution, deblurring, and low-light
enhancement compared with state-of-the-art methods.
comment: 9 pages, 6 figures
☆ New Fashion Products Performance Forecasting: A Survey on Evolutions, Models and Emerging Trends
The fast fashion industry's insatiable demand for new styles and rapid
production cycles has led to a significant environmental burden.
Overproduction, excessive waste, and harmful chemicals have contributed to the
negative environmental impact of the industry. To mitigate these issues, a
paradigm shift that prioritizes sustainability and efficiency is urgently
needed. Integrating learning-based predictive analytics into the fashion
industry represents a significant opportunity to address environmental
challenges and drive sustainable practices. By forecasting fashion trends and
optimizing production, brands can reduce their ecological footprint while
remaining competitive in a rapidly changing market. However, one of the key
challenges in forecasting fashion sales is the dynamic nature of consumer
preferences. Fashion is acyclical, with trends constantly evolving and
resurfacing. In addition, cultural changes and unexpected events can disrupt
established patterns. This problem is also known as New Fashion Products
Performance Forecasting (NFPPF), and it has recently gained more and more
interest in the global research landscape. Given its multidisciplinary nature,
the field of NFPPF has been approached from many different angles. This
comprehensive survey wishes to provide an up-to-date overview that focuses on
learning-based NFPPF strategies. The survey is based on the Preferred Reporting
Items for Systematic Reviews and Meta-Analyses (PRISMA) methodological flow,
allowing for a systematic and complete literature review. In particular, we
propose the first taxonomy that covers the learning panorama for NFPPF,
examining in detail the different methodologies used to increase the amount of
multimodal information, as well as the state-of-the-art available datasets.
Finally, we discuss the challenges and future directions.
comment: Accepted at the Springer Nature Computer Science journal
☆ HiMix: Reducing Computational Complexity in Large Vision-Language Models
Xuange Zhang, Dengjie Li, Bo Liu, Zenghao Bao, Yao Zhou, Baisong Yang, Zhongying Liu, Yujie Zhong, Zheng Zhao, Tongtong Yuan
Benefiting from recent advancements in large language models and modality
alignment techniques, existing Large Vision-Language Models(LVLMs) have
achieved prominent performance across a wide range of scenarios. However, the
excessive computational complexity limits the widespread use of these models in
practical applications. We argue that one main bottleneck in computational
complexity is caused by the involvement of redundant vision sequences in model
computation. This is inspired by a reassessment of the efficiency of vision and
language information transmission in the language decoder of LVLMs. Then, we
propose a novel hierarchical vision-language interaction mechanism called
Hierarchical Vision injection for Mixture Attention (HiMix). In HiMix, only the
language sequence undergoes full forward propagation, while the vision sequence
interacts with the language at specific stages within each language decoder
layer. It is striking that our approach significantly reduces computational
complexity with minimal performance loss. Specifically, HiMix achieves a 10x
reduction in the computational cost of the language decoder across multiple
LVLM models while maintaining comparable performance. This highlights the
advantages of our method, and we hope our research brings new perspectives to
the field of vision-language understanding. Project Page:
https://xuange923.github.io/HiMix
☆ GSTAR: Gaussian Surface Tracking and Reconstruction
3D Gaussian Splatting techniques have enabled efficient photo-realistic
rendering of static scenes. Recent works have extended these approaches to
support surface reconstruction and tracking. However, tracking dynamic surfaces
with 3D Gaussians remains challenging due to complex topology changes, such as
surfaces appearing, disappearing, or splitting. To address these challenges, we
propose GSTAR, a novel method that achieves photo-realistic rendering, accurate
surface reconstruction, and reliable 3D tracking for general dynamic scenes
with changing topology. Given multi-view captures as input, GSTAR binds
Gaussians to mesh faces to represent dynamic objects. For surfaces with
consistent topology, GSTAR maintains the mesh topology and tracks the meshes
using Gaussians. In regions where topology changes, GSTAR adaptively unbinds
Gaussians from the mesh, enabling accurate registration and the generation of
new surfaces based on these optimized Gaussians. Additionally, we introduce a
surface-based scene flow method that provides robust initialization for
tracking between frames. Experiments demonstrate that our method effectively
tracks and reconstructs dynamic surfaces, enabling a range of applications. Our
project page with the code release is available at
https://chengwei-zheng.github.io/GSTAR/.
☆ MutualForce: Mutual-Aware Enhancement for 4D Radar-LiDAR 3D Object Detection ICASSP 2025
Radar and LiDAR have been widely used in autonomous driving as LiDAR provides
rich structure information, and radar demonstrates high robustness under
adverse weather. Recent studies highlight the effectiveness of fusing radar and
LiDAR point clouds. However, challenges remain due to the modality misalignment
and information loss during feature extractions. To address these issues, we
propose a 4D radar-LiDAR framework to mutually enhance their representations.
Initially, the indicative features from radar are utilized to guide both radar
and LiDAR geometric feature learning. Subsequently, to mitigate their sparsity
gap, the shape information from LiDAR is used to enrich radar BEV features.
Extensive experiments on the View-of-Delft (VoD) dataset demonstrate our
approach's superiority over existing methods, achieving the highest mAP of
71.76% across the entire area and 86.36\% within the driving corridor.
Especially for cars, we improve the AP by 4.17% and 4.20% due to the strong
indicative features and symmetric shapes.
comment: Accepted by ICASSP 2025
☆ Robust Egoistic Rigid Body Localization
We consider a robust and self-reliant (or "egoistic") variation of the rigid
body localization (RBL) problem, in which a primary rigid body seeks to
estimate the pose (i.e., location and orientation) of another rigid body (or
"target"), relative to its own, without the assistance of external
infrastructure, without prior knowledge of the shape of the target, and taking
into account the possibility that the available observations are incomplete.
Three complementary contributions are then offered for such a scenario. The
first is a method to estimate the translation vector between the center point
of both rigid bodies, which unlike existing techniques does not require that
both objects have the same shape or even the same number of landmark points.
This technique is shown to significantly outperform the state-of-the-art (SotA)
under complete information, but to be sensitive to data erasures, even when
enhanced by matrix completion methods. The second contribution, designed to
offer improved performance in the presence of incomplete information, offers a
robust alternative to the latter, at the expense of a slight relative loss
under complete information. Finally, the third contribution is a scheme for the
estimation of the rotation matrix describing the relative orientation of the
target rigid body with respect to the primary. Comparisons of the proposed
schemes and SotA techniques demonstrate the advantage of the contributed
methods in terms of root mean square error (RMSE) performance under fully
complete information and incomplete conditions.
☆ Disharmony: Forensics using Reverse Lighting Harmonization
Content generation and manipulation approaches based on deep learning methods
have seen significant advancements, leading to an increased need for techniques
to detect whether an image has been generated or edited. Another area of
research focuses on the insertion and harmonization of objects within images.
In this study, we explore the potential of using harmonization data in
conjunction with a segmentation model to enhance the detection of edited image
regions. These edits can be either manually crafted or generated using deep
learning methods. Our findings demonstrate that this approach can effectively
identify such edits. Existing forensic models often overlook the detection of
harmonized objects in relation to the background, but our proposed Disharmony
Network addresses this gap. By utilizing an aggregated dataset of harmonization
techniques, our model outperforms existing forensic networks in identifying
harmonized objects integrated into their backgrounds, and shows potential for
detecting various forms of edits, including virtual try-on tasks.
☆ Hypercone Assisted Contour Generation for Out-of-Distribution Detection
Recent advances in the field of out-of-distribution (OOD) detection have
placed great emphasis on learning better representations suited to this task.
While there are distance-based approaches, distributional awareness has seldom
been exploited for better performance. We present HAC$_k$-OOD, a novel OOD
detection method that makes no distributional assumption about the data, but
automatically adapts to its distribution. Specifically, HAC$_k$-OOD constructs
a set of hypercones by maximizing the angular distance to neighbors in a given
data-point's vicinity to approximate the contour within which in-distribution
(ID) data-points lie. Experimental results show state-of-the-art FPR@95 and
AUROC performance on Near-OOD detection and on Far-OOD detection on the
challenging CIFAR-100 benchmark without explicitly training for OOD
performance.
☆ Adaptive Clustering for Efficient Phenotype Segmentation of UAV Hyperspectral Data WACV 2025
Unmanned Aerial Vehicles (UAVs) combined with Hyperspectral imaging (HSI)
offer potential for environmental and agricultural applications by capturing
detailed spectral information that enables the prediction of invisible features
like biochemical leaf properties. However, the data-intensive nature of HSI
poses challenges for remote devices, which have limited computational resources
and storage. This paper introduces an Online Hyperspectral Simple Linear
Iterative Clustering algorithm (OHSLIC) framework for real-time tree phenotype
segmentation. OHSLIC reduces inherent noise and computational demands through
adaptive incremental clustering and a lightweight neural network, which
phenotypes trees using leaf contents such as chlorophyll, carotenoids, and
anthocyanins. A hyperspectral dataset is created using a custom simulator that
incorporates realistic leaf parameters, and light interactions. Results
demonstrate that OHSLIC achieves superior regression accuracy and segmentation
performance compared to pixel- or window-based methods while significantly
reducing inference time. The method`s adaptive clustering enables dynamic
trade-offs between computational efficiency and accuracy, paving the way for
scalable edge-device deployment in HSI applications.
comment: accepted WACV 2025 GeoCV workshop
☆ CSHNet: A Novel Information Asymmetric Image Translation Method
Despite advancements in cross-domain image translation, challenges persist in
asymmetric tasks such as SAR-to-Optical and Sketch-to-Instance conversions,
which involve transforming data from a less detailed domain into one with
richer content. Traditional CNN-based methods are effective at capturing fine
details but struggle with global structure, leading to unwanted merging of
image regions. To address this, we propose the CNN-Swin Hybrid Network
(CSHNet), which combines two key modules: Swin Embedded CNN (SEC) and CNN
Embedded Swin (CES), forming the SEC-CES-Bottleneck (SCB). SEC leverages CNN's
detailed feature extraction while integrating the Swin Transformer's structural
bias. CES, in turn, preserves the Swin Transformer's global integrity,
compensating for CNN's lack of focus on structure. Additionally, CSHNet
includes two components designed to enhance cross-domain information retention:
the Interactive Guided Connection (IGC), which enables dynamic information
exchange between SEC and CES, and Adaptive Edge Perception Loss (AEPL), which
maintains structural boundaries during translation. Experimental results show
that CSHNet outperforms existing methods in both visual quality and performance
metrics across scene-level and instance-level datasets. Our code is available
at: https://github.com/XduShi/CSHNet.
☆ Structure-guided Deep Multi-View Clustering
Deep multi-view clustering seeks to utilize the abundant information from
multiple views to improve clustering performance. However, most of the existing
clustering methods often neglect to fully mine multi-view structural
information and fail to explore the distribution of multi-view data, limiting
clustering performance. To address these limitations, we propose a
structure-guided deep multi-view clustering model. Specifically, we introduce a
positive sample selection strategy based on neighborhood relationships, coupled
with a corresponding loss function. This strategy constructs multi-view nearest
neighbor graphs to dynamically redefine positive sample pairs, enabling the
mining of local structural information within multi-view data and enhancing the
reliability of positive sample selection. Additionally, we introduce a Gaussian
distribution model to uncover latent structural information and introduce a
loss function to reduce discrepancies between view embeddings. These two
strategies explore multi-view structural information and data distribution from
different perspectives, enhancing consistency across views and increasing
intra-cluster compactness. Experimental evaluations demonstrate the efficacy of
our method, showing significant improvements in clustering performance on
multiple benchmark datasets compared to state-of-the-art multi-view clustering
approaches.
☆ A Vision-Language Framework for Multispectral Scene Representation Using Language-Grounded Features
Scene understanding in remote sensing often faces challenges in generating
accurate representations for complex environments such as various land use
areas or coastal regions, which may also include snow, clouds, or haze. To
address this, we present a vision-language framework named Spectral LLaVA,
which integrates multispectral data with vision-language alignment techniques
to enhance scene representation and description. Using the BigEarthNet v2
dataset from Sentinel-2, we establish a baseline with RGB-based scene
descriptions and further demonstrate substantial improvements through the
incorporation of multispectral information. Our framework optimizes a
lightweight linear projection layer for alignment while keeping the vision
backbone of SpectralGPT frozen. Our experiments encompass scene classification
using linear probing and language modeling for jointly performing scene
classification and description generation. Our results highlight Spectral
LLaVA's ability to produce detailed and accurate descriptions, particularly for
scenarios where RGB data alone proves inadequate, while also enhancing
classification performance by refining SpectralGPT features into semantically
meaningful representations.
☆ ACE: Anatomically Consistent Embeddings in Composition and Decomposition WACV 2025
Ziyu Zhou, Haozhe Luo, Mohammad Reza Hosseinzadeh Taher, Jiaxuan Pang, Xiaowei Ding, Michael Gotway, Jianming Liang
Medical images acquired from standardized protocols show consistent
macroscopic or microscopic anatomical structures, and these structures consist
of composable/decomposable organs and tissues, but existing self-supervised
learning (SSL) methods do not appreciate such composable/decomposable structure
attributes inherent to medical images. To overcome this limitation, this paper
introduces a novel SSL approach called ACE to learn anatomically consistent
embedding via composition and decomposition with two key branches: (1) global
consistency, capturing discriminative macro-structures via extracting global
features; (2) local consistency, learning fine-grained anatomical details from
composable/decomposable patch features via corresponding matrix matching.
Experimental results across 6 datasets 2 backbones, evaluated in few-shot
learning, fine-tuning, and property analysis, show ACE's superior robustness,
transferability, and clinical potential. The innovations of our ACE lie in
grid-wise image cropping, leveraging the intrinsic properties of
compositionality and decompositionality of medical images, bridging the
semantic gap from high-level pathologies to low-level tissue anomalies, and
providing a new SSL method for medical imaging.
comment: Accepted by WACV 2025
☆ Spatio-temporal Graph Learning on Adaptive Mined Key Frames for High-performance Multi-Object Tracking
In the realm of multi-object tracking, the challenge of accurately capturing
the spatial and temporal relationships between objects in video sequences
remains a significant hurdle. This is further complicated by frequent
occurrences of mutual occlusions among objects, which can lead to tracking
errors and reduced performance in existing methods. Motivated by these
challenges, we propose a novel adaptive key frame mining strategy that
addresses the limitations of current tracking approaches. Specifically, we
introduce a Key Frame Extraction (KFE) module that leverages reinforcement
learning to adaptively segment videos, thereby guiding the tracker to exploit
the intrinsic logic of the video content. This approach allows us to capture
structured spatial relationships between different objects as well as the
temporal relationships of objects across frames. To tackle the issue of object
occlusions, we have developed an Intra-Frame Feature Fusion (IFF) module.
Unlike traditional graph-based methods that primarily focus on inter-frame
feature fusion, our IFF module uses a Graph Convolutional Network (GCN) to
facilitate information exchange between the target and surrounding objects
within a frame. This innovation significantly enhances target
distinguishability and mitigates tracking loss and appearance similarity due to
occlusions. By combining the strengths of both long and short trajectories and
considering the spatial relationships between objects, our proposed tracker
achieves impressive results on the MOT17 dataset, i.e., 68.6 HOTA, 81.0 IDF1,
66.6 AssA, and 893 IDS, proving its effectiveness and accuracy.
☆ FECT: Classification of Breast Cancer Pathological Images Based on Fusion Features
Breast cancer is one of the most common cancers among women globally, with
early diagnosis and precise classification being crucial. With the advancement
of deep learning and computer vision, the automatic classification of breast
tissue pathological images has emerged as a research focus. Existing methods
typically rely on singular cell or tissue features and lack design
considerations for morphological characteristics of challenging-to-classify
categories, resulting in suboptimal classification performance. To address
these problems, we proposes a novel breast cancer tissue classification model
that Fused features of Edges, Cells, and Tissues (FECT), employing the
ResMTUNet and an attention-based aggregator to extract and aggregate these
features. Extensive testing on the BRACS dataset demonstrates that our model
surpasses current advanced methods in terms of classification accuracy and F1
scores. Moreover, due to its feature fusion that aligns with the diagnostic
approach of pathologists, our model exhibits interpretability and holds promise
for significant roles in future clinical applications.
☆ DiffVSR: Enhancing Real-World Video Super-Resolution with Diffusion Models for Advanced Visual Quality and Temporal Consistency
Xiaohui Li, Yihao Liu, Shuo Cao, Ziyan Chen, Shaobin Zhuang, Xiangyu Chen, Yinan He, Yi Wang, Yu Qiao
Diffusion models have demonstrated exceptional capabilities in image
generation and restoration, yet their application to video super-resolution
faces significant challenges in maintaining both high fidelity and temporal
consistency. We present DiffVSR, a diffusion-based framework for real-world
video super-resolution that effectively addresses these challenges through key
innovations. For intra-sequence coherence, we develop a multi-scale temporal
attention module and temporal-enhanced VAE decoder that capture fine-grained
motion details. To ensure inter-sequence stability, we introduce a noise
rescheduling mechanism with an interweaved latent transition approach, which
enhances temporal consistency without additional training overhead. We propose
a progressive learning strategy that transitions from simple to complex
degradations, enabling robust optimization despite limited high-quality video
data. Extensive experiments demonstrate that DiffVSR delivers superior results
in both visual quality and temporal consistency, setting a new performance
standard in real-world video super-resolution.
comment: Project page: \url{https://xh9998.github.io/DiffVSR-project/}
☆ Universal Actions for Enhanced Embodied Foundation Models
Jinliang Zheng, Jianxiong Li, Dongxiu Liu, Yinan Zheng, Zhihao Wang, Zhonghong Ou, Yu Liu, Jingjing Liu, Ya-Qin Zhang, Xianyuan Zhan
Training on diverse, internet-scale data is a key factor in the success of
recent large foundation models. Yet, using the same recipe for building
embodied agents has faced noticeable difficulties. Despite the availability of
many crowd-sourced embodied datasets, their action spaces often exhibit
significant heterogeneity due to distinct physical embodiment and control
interfaces for different robots, causing substantial challenges in developing
embodied foundation models using cross-domain data. In this paper, we introduce
UniAct, a new embodied foundation modeling framework operating in a tokenized
Universal Action Space. Our learned universal actions capture the generic
atomic behaviors across diverse robots by exploiting their shared structural
features, and enable enhanced cross-domain data utilization and
cross-embodiment generalizations by eliminating the notorious heterogeneity.
The universal actions can be efficiently translated back to heterogeneous
actionable commands by simply adding embodiment-specific details, from which
fast adaptation to new robots becomes simple and straightforward. Our 0.5B
instantiation of UniAct outperforms 14X larger SOTA embodied foundation models
in extensive evaluations on various real-world and simulation robots,
showcasing exceptional cross-embodiment control and adaptation capability,
highlighting the crucial benefit of adopting universal actions. Project page:
https://github.com/2toinf/UniAct
comment: Preprint
☆ landmarker: a Toolkit for Anatomical Landmark Localization in 2D/3D Images
Anatomical landmark localization in 2D/3D images is a critical task in
medical imaging. Although many general-purpose tools exist for landmark
localization in classical computer vision tasks, such as pose estimation, they
lack the specialized features and modularity necessary for anatomical landmark
localization applications in the medical domain. Therefore, we introduce
landmarker, a Python package built on PyTorch. The package provides a
comprehensive, flexible toolkit for developing and evaluating landmark
localization algorithms, supporting a range of methodologies, including static
and adaptive heatmap regression. landmarker enhances the accuracy of landmark
identification, streamlines research and development processes, and supports
various image formats and preprocessing pipelines. Its modular design allows
users to customize and extend the toolkit for specific datasets and
applications, accelerating innovation in medical imaging. landmarker addresses
a critical need for precision and customization in landmark localization tasks
not adequately met by existing general-purpose pose estimation tools.
comment: 11 pages, 4 figures
☆ Classifier Ensemble for Efficient Uncertainty Calibration of Deep Neural Networks for Image Classification
This paper investigates novel classifier ensemble techniques for uncertainty
calibration applied to various deep neural networks for image classification.
We evaluate both accuracy and calibration metrics, focusing on Expected
Calibration Error (ECE) and Maximum Calibration Error (MCE). Our work compares
different methods for building simple yet efficient classifier ensembles,
including majority voting and several metamodel-based approaches. Our
evaluation reveals that while state-of-the-art deep neural networks for image
classification achieve high accuracy on standard datasets, they frequently
suffer from significant calibration errors. Basic ensemble techniques like
majority voting provide modest improvements, while metamodel-based ensembles
consistently reduce ECE and MCE across all architectures. Notably, the largest
of our compared metamodels demonstrate the most substantial calibration
improvements, with minimal impact on accuracy. Moreover, classifier ensembles
with metamodels outperform traditional model ensembles in calibration
performance, while requiring significantly fewer parameters. In comparison to
traditional post-hoc calibration methods, our approach removes the need for a
separate calibration dataset. These findings underscore the potential of our
proposed metamodel-based classifier ensembles as an efficient and effective
approach to improving model calibration, thereby contributing to more reliable
deep learning systems.
comment: This paper has been accepted at International Conference on Computer
Vision Theory and Applications (VISAPP), 2025
☆ Leveraging Confident Image Regions for Source-Free Domain-Adaptive Object Detection
Source-free domain-adaptive object detection is an interesting but scarcely
addressed topic. It aims at adapting a source-pretrained detector to a distinct
target domain without resorting to source data during adaptation. So far, there
is no data augmentation scheme tailored to source-free domain-adaptive object
detection. To this end, this paper presents a novel data augmentation approach
that cuts out target image regions where the detector is confident, augments
them along with their respective pseudo-labels, and joins them into a
challenging target image to adapt the detector. As the source data is out of
reach during adaptation, we implement our approach within a teacher-student
learning paradigm to ensure that the model does not collapse during the
adaptation procedure. We evaluated our approach on three adaptation benchmarks
of traffic scenes, scoring new state-of-the-art on two of them.
☆ Few-shot Structure-Informed Machinery Part Segmentation with Foundation Models and Graph Neural Networks WACV
This paper proposes a novel approach to few-shot semantic segmentation for
machinery with multiple parts that exhibit spatial and hierarchical
relationships. Our method integrates the foundation models CLIPSeg and Segment
Anything Model (SAM) with the interest point detector SuperPoint and a graph
convolutional network (GCN) to accurately segment machinery parts. By providing
1 to 25 annotated samples, our model, evaluated on a purely synthetic dataset
depicting a truck-mounted loading crane, achieves effective segmentation across
various levels of detail. Training times are kept under five minutes on
consumer GPUs. The model demonstrates robust generalization to real data,
achieving a qualitative synthetic-to-real generalization with a $J\&F$ score of
92.2 on real data using 10 synthetic support samples. When benchmarked on the
DAVIS 2017 dataset, it achieves a $J\&F$ score of 71.5 in semi-supervised video
segmentation with three support samples. This method's fast training times and
effective generalization to real data make it a valuable tool for autonomous
systems interacting with machinery and infrastructure, and illustrate the
potential of combined and orchestrated foundation models for few-shot
segmentation tasks.
comment: Accepted at Winter Conference on Applications of Computer Vision
(WACV) 2025. Code and available at
https://github.com/AIT-Assistive-Autonomous-Systems/Hopomop
☆ Robust Change Captioning in Remote Sensing: SECOND-CC Dataset and MModalCC Framework
Ali Can Karaca, M. Enes Ozelbas, Saadettin Berber, Orkhan Karimli, Turabi Yildirim, M. Fatih Amasyali
Remote sensing change captioning (RSICC) aims to describe changes between
bitemporal images in natural language. Existing methods often fail under
challenges like illumination differences, viewpoint changes, blur effects,
leading to inaccuracies, especially in no-change regions. Moreover, the images
acquired at different spatial resolutions and have registration errors tend to
affect the captions. To address these issues, we introduce SECOND-CC, a novel
RSICC dataset featuring high-resolution RGB image pairs, semantic segmentation
maps, and diverse real-world scenarios. SECOND-CC which contains 6,041 pairs of
bitemporal RS images and 30,205 sentences describing the differences between
images. Additionally, we propose MModalCC, a multimodal framework that
integrates semantic and visual data using advanced attention mechanisms,
including Cross-Modal Cross Attention (CMCA) and Multimodal Gated Cross
Attention (MGCA). Detailed ablation studies and attention visualizations
further demonstrate its effectiveness and ability to address RSICC challenges.
Comprehensive experiments show that MModalCC outperforms state-of-the-art RSICC
methods, including RSICCformer, Chg2Cap, and PSNet with +4.6% improvement on
BLEU4 score and +9.6% improvement on CIDEr score. We will make our dataset and
codebase publicly available to facilitate future research at
https://github.com/ChangeCapsInRS/SecondCC
comment: This work has been submitted to the IEEE Transactions on Geoscience
and Remote Sensing journal for possible publication
☆ SpatialCoT: Advancing Spatial Reasoning through Coordinate Alignment and Chain-of-Thought for Embodied Task Planning
Yuecheng Liu, Dafeng Chi, Shiguang Wu, Zhanguang Zhang, Yaochen Hu, Lingfeng Zhang, Yingxue Zhang, Shuang Wu, Tongtong Cao, Guowei Huang, Guangjian Tian, Xingyue Quan, Jianye Hao, Yuzheng Zhuang
Spatial reasoning is an essential problem in embodied AI research. Efforts to
enhance spatial reasoning abilities through supplementary spatial data and
fine-tuning have proven limited and ineffective when addressing complex
embodied tasks, largely due to their dependence on language-based outputs.
While some approaches have introduced a point-based action space to mitigate
this issue, they fall short in managing more intricate tasks within complex
environments. This deficiency arises from their failure to fully exploit the
inherent thinking and reasoning capabilities that are fundamental strengths of
Vision-Language Models (VLMs). To address these limitations, we propose a novel
approach named SpatialCoT, specifically designed to bolster the spatial
reasoning capabilities of VLMs. Our approach comprises two stages: spatial
coordinate bi-directional alignment, which aligns vision-language inputs with
spatial coordinates, and chain-of-thought spatial grounding, which harnesses
the reasoning capabilities of language models for advanced spatial reasoning.
We evaluate SpatialCoT on challenging navigation and manipulation tasks, both
in simulation and real-world settings. Experimental results demonstrate that
our method significantly outperforms previous state-of-the-art approaches in
both tasks.
comment: 13 pages, 6 figures
☆ CLIP-PCQA: Exploring Subjective-Aligned Vision-Language Modeling for Point Cloud Quality Assessment
In recent years, No-Reference Point Cloud Quality Assessment (NR-PCQA)
research has achieved significant progress. However, existing methods mostly
seek a direct mapping function from visual data to the Mean Opinion Score
(MOS), which is contradictory to the mechanism of practical subjective
evaluation. To address this, we propose a novel language-driven PCQA method
named CLIP-PCQA. Considering that human beings prefer to describe visual
quality using discrete quality descriptions (e.g., "excellent" and "poor")
rather than specific scores, we adopt a retrieval-based mapping strategy to
simulate the process of subjective assessment. More specifically, based on the
philosophy of CLIP, we calculate the cosine similarity between the visual
features and multiple textual features corresponding to different quality
descriptions, in which process an effective contrastive loss and learnable
prompts are introduced to enhance the feature extraction. Meanwhile, given the
personal limitations and bias in subjective experiments, we further covert the
feature similarities into probabilities and consider the Opinion Score
Distribution (OSD) rather than a single MOS as the final target. Experimental
results show that our CLIP-PCQA outperforms other State-Of-The-Art (SOTA)
approaches.
☆ FiLo++: Zero-/Few-Shot Anomaly Detection by Fused Fine-Grained Descriptions and Deformable Localization
Anomaly detection methods typically require extensive normal samples from the
target class for training, limiting their applicability in scenarios that
require rapid adaptation, such as cold start. Zero-shot and few-shot anomaly
detection do not require labeled samples from the target class in advance,
making them a promising research direction. Existing zero-shot and few-shot
approaches often leverage powerful multimodal models to detect and localize
anomalies by comparing image-text similarity. However, their handcrafted
generic descriptions fail to capture the diverse range of anomalies that may
emerge in different objects, and simple patch-level image-text matching often
struggles to localize anomalous regions of varying shapes and sizes. To address
these issues, this paper proposes the FiLo++ method, which consists of two key
components. The first component, Fused Fine-Grained Descriptions (FusDes),
utilizes large language models to generate anomaly descriptions for each object
category, combines both fixed and learnable prompt templates and applies a
runtime prompt filtering method, producing more accurate and task-specific
textual descriptions. The second component, Deformable Localization (DefLoc),
integrates the vision foundation model Grounding DINO with position-enhanced
text descriptions and a Multi-scale Deformable Cross-modal Interaction (MDCI)
module, enabling accurate localization of anomalies with various shapes and
sizes. In addition, we design a position-enhanced patch matching approach to
improve few-shot anomaly detection performance. Experiments on multiple
datasets demonstrate that FiLo++ achieves significant performance improvements
compared with existing methods. Code will be available at
https://github.com/CASIA-IVA-Lab/FiLo.
☆ One-D-Piece: Image Tokenizer Meets Quality-Controllable Compression
Current image tokenization methods require a large number of tokens to
capture the information contained within images. Although the amount of
information varies across images, most image tokenizers only support
fixed-length tokenization, leading to inefficiency in token allocation. In this
study, we introduce One-D-Piece, a discrete image tokenizer designed for
variable-length tokenization, achieving quality-controllable mechanism. To
enable variable compression rate, we introduce a simple but effective
regularization mechanism named "Tail Token Drop" into discrete one-dimensional
image tokenizers. This method encourages critical information to concentrate at
the head of the token sequence, enabling support of variadic tokenization,
while preserving state-of-the-art reconstruction quality. We evaluate our
tokenizer across multiple reconstruction quality metrics and find that it
delivers significantly better perceptual quality than existing
quality-controllable compression methods, including JPEG and WebP, at smaller
byte sizes. Furthermore, we assess our tokenizer on various downstream computer
vision tasks, including image classification, object detection, semantic
segmentation, and depth estimation, confirming its adaptability to numerous
applications compared to other variable-rate methods. Our approach demonstrates
the versatility of variable-length discrete image tokenization, establishing a
new paradigm in both compression efficiency and reconstruction performance.
Finally, we validate the effectiveness of tail token drop via detailed analysis
of tokenizers.
comment: Our Project Page:
https://turingmotors.github.io/one-d-piece-tokenizer
☆ LWGANet: A Lightweight Group Attention Backbone for Remote Sensing Visual Tasks
Remote sensing (RS) visual tasks have gained significant academic and
practical importance. However, they encounter numerous challenges that hinder
effective feature extraction, including the detection and recognition of
multiple objects exhibiting substantial variations in scale within a single
image. While prior dual-branch or multi-branch architectural strategies have
been effective in managing these object variances, they have concurrently
resulted in considerable increases in computational demands and parameter
counts. Consequently, these architectures are rendered less viable for
deployment on resource-constrained devices. Contemporary lightweight backbone
networks, designed primarily for natural images, frequently encounter
difficulties in effectively extracting features from multi-scale objects, which
compromises their efficacy in RS visual tasks. This article introduces LWGANet,
a specialized lightweight backbone network tailored for RS visual tasks,
incorporating a novel lightweight group attention (LWGA) module designed to
address these specific challenges. LWGA module, tailored for RS imagery,
adeptly harnesses redundant features to extract a wide range of spatial
information, from local to global scales, without introducing additional
complexity or computational overhead. This facilitates precise feature
extraction across multiple scales within an efficient framework.LWGANet was
rigorously evaluated across twelve datasets, which span four crucial RS visual
tasks: scene classification, oriented object detection, semantic segmentation,
and change detection. The results confirm LWGANet's widespread applicability
and its ability to maintain an optimal balance between high performance and low
complexity, achieving SOTA results across diverse datasets. LWGANet emerged as
a novel solution for resource-limited scenarios requiring robust RS image
processing capabilities.
comment: 12 pages, 8 figures, Remote sensing
☆ X-Dyna: Expressive Dynamic Human Image Animation
Di Chang, Hongyi Xu, You Xie, Yipeng Gao, Zhengfei Kuang, Shengqu Cai, Chenxu Zhang, Guoxian Song, Chao Wang, Yichun Shi, Zeyuan Chen, Shijie Zhou, Linjie Luo, Gordon Wetzstein, Mohammad Soleymani
We introduce X-Dyna, a novel zero-shot, diffusion-based pipeline for
animating a single human image using facial expressions and body movements
derived from a driving video, that generates realistic, context-aware dynamics
for both the subject and the surrounding environment. Building on prior
approaches centered on human pose control, X-Dyna addresses key shortcomings
causing the loss of dynamic details, enhancing the lifelike qualities of human
video animations. At the core of our approach is the Dynamics-Adapter, a
lightweight module that effectively integrates reference appearance context
into the spatial attentions of the diffusion backbone while preserving the
capacity of motion modules in synthesizing fluid and intricate dynamic details.
Beyond body pose control, we connect a local control module with our model to
capture identity-disentangled facial expressions, facilitating accurate
expression transfer for enhanced realism in animated scenes. Together, these
components form a unified framework capable of learning physical human motion
and natural scene dynamics from a diverse blend of human and scene videos.
Comprehensive qualitative and quantitative evaluations demonstrate that X-Dyna
outperforms state-of-the-art methods, creating highly lifelike and expressive
animations. The code is available at https://github.com/bytedance/X-Dyna.
comment: Project page:https://x-dyna.github.io/xdyna.github.io/
Code:https://github.com/bytedance/X-Dyna
☆ Textoon: Generating Vivid 2D Cartoon Characters from Text Descriptions
The 2D cartoon style is a prominent art form in digital character creation,
particularly popular among younger audiences. While advancements in digital
human technology have spurred extensive research into photorealistic digital
humans and 3D characters, interactive 2D cartoon characters have received
comparatively less attention. Unlike 3D counterparts, which require
sophisticated construction and resource-intensive rendering, Live2D, a
widely-used format for 2D cartoon characters, offers a more efficient
alternative, which allows to animate 2D characters in a manner that simulates
3D movement without the necessity of building a complete 3D model. Furthermore,
Live2D employs lightweight HTML5 (H5) rendering, improving both accessibility
and efficiency. In this technical report, we introduce Textoon, an innovative
method for generating diverse 2D cartoon characters in the Live2D format based
on text descriptions. The Textoon leverages cutting-edge language and vision
models to comprehend textual intentions and generate 2D appearance, capable of
creating a wide variety of stunning and interactive 2D characters within one
minute. The project homepage is https://human3daigc.github.io/Textoon_webpage/.
☆ DiffuEraser: A Diffusion Model for Video Inpainting
Recent video inpainting algorithms integrate flow-based pixel propagation
with transformer-based generation to leverage optical flow for restoring
textures and objects using information from neighboring frames, while
completing masked regions through visual Transformers. However, these
approaches often encounter blurring and temporal inconsistencies when dealing
with large masks, highlighting the need for models with enhanced generative
capabilities. Recently, diffusion models have emerged as a prominent technique
in image and video generation due to their impressive performance. In this
paper, we introduce DiffuEraser, a video inpainting model based on stable
diffusion, designed to fill masked regions with greater details and more
coherent structures. We incorporate prior information to provide initialization
and weak conditioning,which helps mitigate noisy artifacts and suppress
hallucinations. Additionally, to improve temporal consistency during
long-sequence inference, we expand the temporal receptive fields of both the
prior model and DiffuEraser, and further enhance consistency by leveraging the
temporal smoothing property of Video Diffusion Models. Experimental results
demonstrate that our proposed method outperforms state-of-the-art techniques in
both content completeness and temporal consistency while maintaining acceptable
efficiency.
comment: 11pages, 13figures
☆ Mitigating Hallucinations on Object Attributes using Multiview Images and Negative Instructions ICASSP 2025
Current popular Large Vision-Language Models (LVLMs) are suffering from
Hallucinations on Object Attributes (HoOA), leading to incorrect determination
of fine-grained attributes in the input images. Leveraging significant
advancements in 3D generation from a single image, this paper proposes a novel
method to mitigate HoOA in LVLMs. This method utilizes multiview images sampled
from generated 3D representations as visual prompts for LVLMs, thereby
providing more visual information from other viewpoints. Furthermore, we
observe the input order of multiple multiview images significantly affects the
performance of LVLMs. Consequently, we have devised Multiview Image Augmented
VLM (MIAVLM), incorporating a Multiview Attributes Perceiver (MAP) submodule
capable of simultaneously eliminating the influence of input image order and
aligning visual information from multiview images with Large Language Models
(LLMs). Besides, we designed and employed negative instructions to mitigate
LVLMs' bias towards ``Yes" responses. Comprehensive experiments demonstrate the
effectiveness of our method.
comment: 2025 IEEE International Conference on Acoustics, Speech, and Signal
Processing (ICASSP 2025)
☆ Deep Learning for Early Alzheimer Disease Detection with MRI Scans
Alzheimer's Disease is a neurodegenerative condition characterized by
dementia and impairment in neurological function. The study primarily focuses
on the individuals above age 40, affecting their memory, behavior, and
cognitive processes of the brain. Alzheimer's disease requires diagnosis by a
detailed assessment of MRI scans and neuropsychological tests of the patients.
This project compares existing deep learning models in the pursuit of enhancing
the accuracy and efficiency of AD diagnosis, specifically focusing on the
Convolutional Neural Network, Bayesian Convolutional Neural Network, and the
U-net model with the Open Access Series of Imaging Studies brain MRI dataset.
Besides, to ensure robustness and reliability in the model evaluations, we
address the challenge of imbalance in data. We then perform rigorous evaluation
to determine strengths and weaknesses for each model by considering
sensitivity, specificity, and computational efficiency. This comparative
analysis would shed light on the future role of AI in revolutionizing AD
diagnostics but also paved ways for future innovation in medical imaging and
the management of neurodegenerative diseases.
☆ Multi-Modal Attention Networks for Enhanced Segmentation and Depth Estimation of Subsurface Defects in Pulse Thermography
AI-driven pulse thermography (PT) has become a crucial tool in
non-destructive testing (NDT), enabling automatic detection of hidden anomalies
in various industrial components. Current state-of-the-art techniques feed
segmentation and depth estimation networks compressed PT sequences using either
Principal Component Analysis (PCA) or Thermographic Signal Reconstruction
(TSR). However, treating these two modalities independently constrains the
performance of PT inspection models as these representations possess
complementary semantic features. To address this limitation, this work proposes
PT-Fusion, a multi-modal attention-based fusion network that fuses both PCA and
TSR modalities for defect segmentation and depth estimation of subsurface
defects in PT setups. PT-Fusion introduces novel feature fusion modules,
Encoder Attention Fusion Gate (EAFG) and Attention Enhanced Decoding Block
(AEDB), to fuse PCA and TSR features for enhanced segmentation and depth
estimation of subsurface defects. In addition, a novel data augmentation
technique is proposed based on random data sampling from thermographic
sequences to alleviate the scarcity of PT datasets. The proposed method is
benchmarked against state-of-the-art PT inspection models, including U-Net,
attention U-Net, and 3D-CNN on the Universit\'e Laval IRT-PVC dataset. The
results demonstrate that PT-Fusion outperforms the aforementioned models in
defect segmentation and depth estimation accuracies with a margin of 10%.
comment: Pulse thermography, infrared thermography, defect segmentation,
multi-modal networks, attention mechanism
☆ RichSpace: Enriching Text-to-Video Prompt Space via Text Embedding Interpolation
Text-to-video generation models have made impressive progress, but they still
struggle with generating videos with complex features. This limitation often
arises from the inability of the text encoder to produce accurate embeddings,
which hinders the video generation model. In this work, we propose a novel
approach to overcome this challenge by selecting the optimal text embedding
through interpolation in the embedding space. We demonstrate that this method
enables the video generation model to produce the desired videos. Additionally,
we introduce a simple algorithm using perpendicular foot embeddings and cosine
similarity to identify the optimal interpolation embedding. Our findings
highlight the importance of accurate text embeddings and offer a pathway for
improving text-to-video generation performance.
☆ Aneumo: A Large-Scale Comprehensive Synthetic Dataset of Aneurysm Hemodynamics
Xigui Li, Yuanye Zhou, Feiyang Xiao, Xin Guo, Yichi Zhang, Chen Jiang, Jianchao Ge, Xiansheng Wang, Qimeng Wang, Taiwei Zhang, Chensen Lin, Yuan Cheng, Yuan Qi
Intracranial aneurysm (IA) is a common cerebrovascular disease that is
usually asymptomatic but may cause severe subarachnoid hemorrhage (SAH) if
ruptured. Although clinical practice is usually based on individual factors and
morphological features of the aneurysm, its pathophysiology and hemodynamic
mechanisms remain controversial. To address the limitations of current
research, this study constructed a comprehensive hemodynamic dataset of
intracranial aneurysms. The dataset is based on 466 real aneurysm models, and
10,000 synthetic models were generated by resection and deformation operations,
including 466 aneurysm-free models and 9,534 deformed aneurysm models. The
dataset also provides medical image-like segmentation mask files to support
insightful analysis. In addition, the dataset contains hemodynamic data
measured at eight steady-state flow rates (0.001 to 0.004 kg/s), including
critical parameters such as flow velocity, pressure, and wall shear stress,
providing a valuable resource for investigating aneurysm pathogenesis and
clinical prediction. This dataset will help advance the understanding of the
pathologic features and hemodynamic mechanisms of intracranial aneurysms and
support in-depth research in related fields. Dataset hosted at
https://github.com/Xigui-Li/Aneumo.
☆ GaussianAvatar-Editor: Photorealistic Animatable Gaussian Head Avatar Editor 3DV 2025
We introduce GaussianAvatar-Editor, an innovative framework for text-driven
editing of animatable Gaussian head avatars that can be fully controlled in
expression, pose, and viewpoint. Unlike static 3D Gaussian editing, editing
animatable 4D Gaussian avatars presents challenges related to motion occlusion
and spatial-temporal inconsistency. To address these issues, we propose the
Weighted Alpha Blending Equation (WABE). This function enhances the blending
weight of visible Gaussians while suppressing the influence on non-visible
Gaussians, effectively handling motion occlusion during editing. Furthermore,
to improve editing quality and ensure 4D consistency, we incorporate
conditional adversarial learning into the editing process. This strategy helps
to refine the edited results and maintain consistency throughout the animation.
By integrating these methods, our GaussianAvatar-Editor achieves photorealistic
and consistent results in animatable 4D Gaussian editing. We conduct
comprehensive experiments across various subjects to validate the effectiveness
of our proposed techniques, which demonstrates the superiority of our approach
over existing methods. More results and code are available at: [Project
Link](https://xiangyueliu.github.io/GaussianAvatar-Editor/).
comment: Accepted to 3DV 2025. [Project
Link](https://xiangyueliu.github.io/GaussianAvatar-Editor/)
☆ Explainable artificial intelligence (XAI): from inherent explainability to large language models
Artificial Intelligence (AI) has continued to achieve tremendous success in
recent times. However, the decision logic of these frameworks is often not
transparent, making it difficult for stakeholders to understand, interpret or
explain their behavior. This limitation hinders trust in machine learning
systems and causes a general reluctance towards their adoption in practical
applications, particularly in mission-critical domains like healthcare and
autonomous driving. Explainable AI (XAI) techniques facilitate the
explainability or interpretability of machine learning models, enabling users
to discern the basis of the decision and possibly avert undesirable behavior.
This comprehensive survey details the advancements of explainable AI methods,
from inherently interpretable models to modern approaches for achieving
interpretability of various black box models, including large language models
(LLMs). Additionally, we review explainable AI techniques that leverage LLM and
vision-language model (VLM) frameworks to automate or improve the
explainability of other machine learning models. The use of LLM and VLM as
interpretability methods particularly enables high-level, semantically
meaningful explanations of model decisions and behavior. Throughout the paper,
we highlight the scientific principles, strengths and weaknesses of
state-of-the-art methods and outline different areas of improvement. Where
appropriate, we also present qualitative and quantitative comparison results of
various methods to show how they compare. Finally, we discuss the key
challenges of XAI and directions for future research.
☆ Discrete Prior-based Temporal-coherent Content Prediction for Blind Face Video Restoration
Blind face video restoration aims to restore high-fidelity details from
videos subjected to complex and unknown degradations. This task poses a
significant challenge of managing temporal heterogeneity while at the same time
maintaining stable face attributes. In this paper, we introduce a Discrete
Prior-based Temporal-Coherent content prediction transformer to address the
challenge, and our model is referred to as DP-TempCoh. Specifically, we
incorporate a spatial-temporal-aware content prediction module to synthesize
high-quality content from discrete visual priors, conditioned on degraded video
tokens. To further enhance the temporal coherence of the predicted content, a
motion statistics modulation module is designed to adjust the content, based on
discrete motion priors in terms of cross-frame mean and variance. As a result,
the statistics of the predicted content can match with that of real videos over
time. By performing extensive experiments, we verify the effectiveness of the
design elements and demonstrate the superior performance of our DP-TempCoh in
both synthetically and naturally degraded video restoration.
☆ Surface-SOS: Self-Supervised Object Segmentation via Neural Surface Representation
Self-supervised Object Segmentation (SOS) aims to segment objects without any
annotations. Under conditions of multi-camera inputs, the structural, textural
and geometrical consistency among each view can be leveraged to achieve
fine-grained object segmentation. To make better use of the above information,
we propose Surface representation based Self-supervised Object Segmentation
(Surface-SOS), a new framework to segment objects for each view by 3D surface
representation from multi-view images of a scene. To model high-quality
geometry surfaces for complex scenes, we design a novel scene representation
scheme, which decomposes the scene into two complementary neural representation
modules respectively with a Signed Distance Function (SDF). Moreover,
Surface-SOS is able to refine single-view segmentation with multi-view
unlabeled images, by introducing coarse segmentation masks as additional input.
To the best of our knowledge, Surface-SOS is the first self-supervised approach
that leverages neural surface representation to break the dependence on large
amounts of annotated data and strong constraints. These constraints typically
involve observing target objects against a static background or relying on
temporal supervision in videos. Extensive experiments on standard benchmarks
including LLFF, CO3D, BlendedMVS, TUM and several real-world scenes show that
Surface-SOS always yields finer object masks than its NeRF-based counterparts
and surpasses supervised single-view baselines remarkably. Code is available
at: https://github.com/zhengxyun/Surface-SOS.
comment: Accepted by TIP
☆ A Multi-Scale Feature Extraction and Fusion Deep Learning Method for Classification of Wheat Diseases
Wheat is an important source of dietary fiber and protein that is negatively
impacted by a number of risks to its growth. The difficulty of identifying and
classifying wheat diseases is discussed with an emphasis on wheat loose smut,
leaf rust, and crown and root rot. Addressing conditions like crown and root
rot, this study introduces an innovative approach that integrates multi-scale
feature extraction with advanced image segmentation techniques to enhance
classification accuracy. The proposed method uses neural network models
Xception, Inception V3, and ResNet 50 to train on a large wheat disease
classification dataset 2020 in conjunction with an ensemble of machine vision
classifiers, including voting and stacking. The study shows that the suggested
methodology has a superior accuracy of 99.75% in the classification of wheat
diseases when compared to current state-of-the-art approaches. A deep learning
ensemble model Xception showed the highest accuracy.
☆ Physics-informed DeepCT: Sinogram Wavelet Decomposition Meets Masked Diffusion
Diffusion model shows remarkable potential on sparse-view computed tomography
(SVCT) reconstruction. However, when a network is trained on a limited sample
space, its generalization capability may be constrained, which degrades
performance on unfamiliar data. For image generation tasks, this can lead to
issues such as blurry details and inconsistencies between regions. To alleviate
this problem, we propose a Sinogram-based Wavelet random decomposition And
Random mask diffusion Model (SWARM) for SVCT reconstruction. Specifically,
introducing a random mask strategy in the sinogram effectively expands the
limited training sample space. This enables the model to learn a broader range
of data distributions, enhancing its understanding and generalization of data
uncertainty. In addition, applying a random training strategy to the
high-frequency components of the sinogram wavelet enhances feature
representation and improves the ability to capture details in different
frequency bands, thereby improving performance and robustness. Two-stage
iterative reconstruction method is adopted to ensure the global consistency of
the reconstructed image while refining its details. Experimental results
demonstrate that SWARM outperforms competing approaches in both quantitative
and qualitative performance across various datasets.
☆ IE-Bench: Advancing the Measurement of Text-Driven Image Editing for Human Perception Alignment
Recent advances in text-driven image editing have been significant, yet the
task of accurately evaluating these edited images continues to pose a
considerable challenge. Different from the assessment of text-driven image
generation, text-driven image editing is characterized by simultaneously
conditioning on both text and a source image. The edited images often retain an
intrinsic connection to the original image, which dynamically change with the
semantics of the text. However, previous methods tend to solely focus on
text-image alignment or have not aligned with human perception. In this work,
we introduce the Text-driven Image Editing Benchmark suite (IE-Bench) to
enhance the assessment of text-driven edited images. IE-Bench includes a
database contains diverse source images, various editing prompts and the
corresponding results different editing methods, and total 3,010 Mean Opinion
Scores (MOS) provided by 25 human subjects. Furthermore, we introduce IE-QA, a
multi-modality source-aware quality assessment method for text-driven image
editing. To the best of our knowledge, IE-Bench offers the first IQA dataset
and model tailored for text-driven image editing. Extensive experiments
demonstrate IE-QA's superior subjective-alignments on the text-driven image
editing task compared with previous metrics. We will make all related data and
code available to the public.
☆ ForestProtector: An IoT Architecture Integrating Machine Vision and Deep Reinforcement Learning for Efficient Wildfire Monitoring
Kenneth Bonilla-Ormachea, Horacio Cuizaga, Edwin Salcedo, Sebastian Castro, Sergio Fernandez-Testa, Misael Mamani
Early detection of forest fires is crucial to minimizing the environmental
and socioeconomic damage they cause. Indeed, a fire's duration directly
correlates with the difficulty and cost of extinguishing it. For instance, a
fire burning for 1 minute might require 1 liter of water to extinguish, while a
2-minute fire could demand 100 liters, and a 10-minute fire might necessitate
1,000 liters. On the other hand, existing fire detection systems based on novel
technologies (e.g., remote sensing, PTZ cameras, UAVs) are often expensive and
require human intervention, making continuous monitoring of large areas
impractical. To address this challenge, this work proposes a low-cost forest
fire detection system that utilizes a central gateway device with computer
vision capabilities to monitor a 360{\deg} field of view for smoke at long
distances. A deep reinforcement learning agent enhances surveillance by
dynamically controlling the camera's orientation, leveraging real-time sensor
data (smoke levels, ambient temperature, and humidity) from distributed IoT
devices. This approach enables automated wildfire monitoring across expansive
areas while reducing false positives.
comment: Accepted for publication in the proceedings of the 11th International
Conference on Automation, Robotics, and Applications (ICARA 2025)
☆ TalkingEyes: Pluralistic Speech-Driven 3D Eye Gaze Animation
Although significant progress has been made in the field of speech-driven 3D
facial animation recently, the speech-driven animation of an indispensable
facial component, eye gaze, has been overlooked by recent research. This is
primarily due to the weak correlation between speech and eye gaze, as well as
the scarcity of audio-gaze data, making it very challenging to generate 3D eye
gaze motion from speech alone. In this paper, we propose a novel data-driven
method which can generate diverse 3D eye gaze motions in harmony with the
speech. To achieve this, we firstly construct an audio-gaze dataset that
contains about 14 hours of audio-mesh sequences featuring high-quality eye gaze
motion, head motion and facial motion simultaneously. The motion data is
acquired by performing lightweight eye gaze fitting and face reconstruction on
videos from existing audio-visual datasets. We then tailor a novel
speech-to-motion translation framework in which the head motions and eye gaze
motions are jointly generated from speech but are modeled in two separate
latent spaces. This design stems from the physiological knowledge that the
rotation range of eyeballs is less than that of head. Through mapping the
speech embedding into the two latent spaces, the difficulty in modeling the
weak correlation between speech and non-verbal motion is thus attenuated.
Finally, our TalkingEyes, integrated with a speech-driven 3D facial motion
generator, can synthesize eye gaze motion, eye blinks, head motion and facial
motion collectively from speech. Extensive quantitative and qualitative
evaluations demonstrate the superiority of the proposed method in generating
diverse and natural 3D eye gaze motions from speech. The project page of this
paper is: https://lkjkjoiuiu.github.io/TalkingEyes_Home/
☆ SLIM: Sim-to-Real Legged Instructive Manipulation via Long-Horizon Visuomotor Learning
We present a low-cost quadruped manipulation system that solves long-horizon
real-world tasks, trained by reinforcement learning purely in simulation. The
system comprises 1) a hierarchical design of a high-level policy for
visual-mobile manipulation following instructions, and a low-level policy for
quadruped movement and limb-control, 2) a progressive policy expansion approach
for solving the long-horizon task together with a teacher-student framework for
efficient high-level training of the high-level visuomotor policy, and 3) a
suite of techniques for minimizing sim-to-real gaps.
With budget-friendly but limited reliability and performance hardware, and
just one wrist-mounted RGB camera, the entire system fully trained in
simulation achieves high success rates for long horizon tasks involving search,
move, grasp, and drop-into, with fluid sim-to-real transfer in a wide variety
of indoor and outdoor scenes and lighting conditions.Extensive real-world
evaluations show that on the long horizon mobile manipulation tasks, our system
achieves good performance when transferred to real both in terms of task
success rate and execution efficiency. Finally, we discuss the necessity of our
sim-to-real techniques for legged mobile manipulation, and show their ablation
performance.
☆ FoundationStereo: Zero-Shot Stereo Matching
Tremendous progress has been made in deep stereo matching to excel on
benchmark datasets through per-domain fine-tuning. However, achieving strong
zero-shot generalization - a hallmark of foundation models in other computer
vision tasks - remains challenging for stereo matching. We introduce
FoundationStereo, a foundation model for stereo depth estimation designed to
achieve strong zero-shot generalization. To this end, we first construct a
large-scale (1M stereo pairs) synthetic training dataset featuring large
diversity and high photorealism, followed by an automatic self-curation
pipeline to remove ambiguous samples. We then design a number of network
architecture components to enhance scalability, including a side-tuning feature
backbone that adapts rich monocular priors from vision foundation models to
mitigate the sim-to-real gap, and long-range context reasoning for effective
cost volume filtering. Together, these components lead to strong robustness and
accuracy across domains, establishing a new standard in zero-shot stereo depth
estimation.
☆ FLORA: Formal Language Model Enables Robust Training-free Zero-shot Object Referring Analysis
Object Referring Analysis (ORA), commonly known as referring expression
comprehension, requires the identification and localization of specific objects
in an image based on natural descriptions. Unlike generic object detection, ORA
requires both accurate language understanding and precise visual localization,
making it inherently more complex. Although recent pre-trained large visual
grounding detectors have achieved significant progress, they heavily rely on
extensively labeled data and time-consuming learning. To address these, we
introduce a novel, training-free framework for zero-shot ORA, termed FLORA
(Formal Language for Object Referring and Analysis). FLORA harnesses the
inherent reasoning capabilities of large language models (LLMs) and integrates
a formal language model - a logical framework that regulates language within
structured, rule-based descriptions - to provide effective zero-shot ORA. More
specifically, our formal language model (FLM) enables an effective,
logic-driven interpretation of object descriptions without necessitating any
training processes. Built upon FLM-regulated LLM outputs, we further devise a
Bayesian inference framework and employ appropriate off-the-shelf interpretive
models to finalize the reasoning, delivering favorable robustness against LLM
hallucinations and compelling ORA performance in a training-free manner. In
practice, our FLORA boosts the zero-shot performance of existing pretrained
grounding detectors by up to around 45%. Our comprehensive evaluation across
different challenging datasets also confirms that FLORA consistently surpasses
current state-of-the-art zero-shot methods in both detection and segmentation
tasks associated with zero-shot ORA. We believe our probabilistic parsing and
reasoning of the LLM outputs elevate the reliability and interpretability of
zero-shot ORA. We shall release codes upon publication.
♻ ☆ MVTamperBench: Evaluating Robustness of Vision-Language Models
Amit Agarwal, Srikant Panda, Angeline Charles, Bhargava Kumar, Hitesh Patel, Priyaranjan Pattnayak, Taki Hasan Rafi, Tejaswini Kumar, Dong-Kyu Chae
Multimodal Large Language Models (MLLMs) have driven major advances in video
understanding, yet their vulnerability to adversarial tampering and
manipulations remains underexplored. To address this gap, we introduce
MVTamperBench, a benchmark that systematically evaluates MLLM robustness
against five prevalent tampering techniques: rotation, masking, substitution,
repetition, and dropping. Built from 3.4K original videos-expanded to over 17K
tampered clips spanning 19 video tasks.
MVTamperBench challenges models to detect manipulations in spatial and
temporal coherence. We evaluate 45 recent MLLMs from 15+ model families,
revealing substantial variability in resilience across tampering types and
showing that larger parameter counts do not necessarily guarantee robustness.
MVTamperBench sets a new benchmark for developing tamper-resilient MLLM in
safety-critical applications, including detecting clickbait, preventing harmful
content distribution, and enforcing policies on media platforms. We release all
code and data to foster open research in trustworthy video understanding.
Code: https://amitbcp.github.io/MVTamperBench/ Data:
https://huggingface.co/datasets/Srikant86/MVTamperBench
♻ ☆ Mesh2SLAM in VR: A Fast Geometry-Based SLAM Framework for Rapid Prototyping in Virtual Reality Applications
SLAM is a foundational technique with broad applications in robotics and
AR/VR. SLAM simulations evaluate new concepts, but testing on
resource-constrained devices, such as VR HMDs, faces challenges: high
computational cost and restricted sensor data access. This work proposes a
sparse framework using mesh geometry projections as features, which improves
efficiency and circumvents direct sensor data access, advancing SLAM research
as we demonstrate in VR and through numerical evaluation.
♻ ☆ ESVO2: Direct Visual-Inertial Odometry with Stereo Event Cameras
Event-based visual odometry is a specific branch of visual Simultaneous
Localization and Mapping (SLAM) techniques, which aims at solving tracking and
mapping subproblems (typically in parallel), by exploiting the special working
principles of neuromorphic (i.e., event-based) cameras. Due to the
motion-dependent nature of event data, explicit data association (i.e., feature
matching) under large-baseline view-point changes is difficult to establish,
making direct methods a more rational choice. However, state-of-the-art direct
methods are limited by the high computational complexity of the mapping
sub-problem and the degeneracy of camera pose tracking in certain degrees of
freedom (DoF) in rotation. In this paper, we tackle these issues by building an
event-based stereo visual-inertial odometry system on top of a direct pipeline.
Specifically, to speed up the mapping operation, we propose an efficient
strategy for sampling contour points according to the local dynamics of events.
The mapping performance is also improved in terms of structure completeness and
local smoothness by merging the temporal stereo and static stereo results. To
circumvent the degeneracy of camera pose tracking in recovering the pitch and
yaw components of general 6-DoF motion, we introduce IMU measurements as motion
priors via pre-integration. To this end, a compact back-end is proposed for
continuously updating the IMU bias and predicting the linear velocity, enabling
an accurate motion prediction for camera pose tracking. The resulting system
scales well with modern high-resolution event cameras and leads to better
global positioning accuracy in large-scale outdoor environments. Extensive
evaluations on five publicly available datasets featuring different resolutions
and scenarios justify the superior performance of the proposed system against
five state-of-the-art methods.
♻ ☆ BILTS: A Bi-Invariant Similarity Measure for Robust Object Trajectory Recognition under Reference Frame Variations
When similar object motions are performed in diverse contexts but are meant
to be recognized under a single classification, these contextual variations act
as disturbances that negatively affect accurate motion recognition. In this
paper, we focus on contextual variations caused by reference frame variations.
To robustly deal with these variations, similarity measures have been
introduced that compare object motion trajectories in a context-invariant
manner. However, most are highly sensitive to noise near singularities, where
the measure is not uniquely defined, and lack bi-invariance (invariance to both
world and body frame variations). To address these issues, we propose the novel
\textit{Bi-Invariant Local Trajectory-Shape Similarity} (BILTS) measure.
Compared to other measures, the BILTS measure uniquely offers bi-invariance,
boundedness, and third-order shape identity. Aimed at practical
implementations, we devised a discretized and regularized version of the BILTS
measure which shows exceptional robustness to singularities. This is
demonstrated through rigorous recognition experiments using multiple datasets.
On average, BILTS attained the highest recognition ratio and least sensitivity
to contextual variations compared to other invariant object motion similarity
measures. We believe that the BILTS measure is a valuable tool for recognizing
motions performed in diverse contexts and has potential in other applications,
including the recognition, segmentation, and adaptation of both motion and
force trajectories.
comment: This work has been submitted as a regular research paper for
consideration in the Journal of Intelligent & Robotic Systems. The content in
this preprint is identical to the version submitted for peer review, except
for formatting differences required by the journal
♻ ☆ Bridging Diversity and Uncertainty in Active learning with Self-Supervised Pre-Training ICLR 2024
This study addresses the integration of diversity-based and uncertainty-based
sampling strategies in active learning, particularly within the context of
self-supervised pre-trained models. We introduce a straightforward heuristic
called TCM that mitigates the cold start problem while maintaining strong
performance across various data levels. By initially applying TypiClust for
diversity sampling and subsequently transitioning to uncertainty sampling with
Margin, our approach effectively combines the strengths of both strategies. Our
experiments demonstrate that TCM consistently outperforms existing methods
across various datasets in both low and high data regimes.
comment: Accepted at ICLR 2024 Workshop on Practical Machine Learning for Low
Resource Settings (PML4LRS)
♻ ☆ Deep Compression Autoencoder for Efficient High-Resolution Diffusion Models
We present Deep Compression Autoencoder (DC-AE), a new family of autoencoder
models for accelerating high-resolution diffusion models. Existing autoencoder
models have demonstrated impressive results at a moderate spatial compression
ratio (e.g., 8x), but fail to maintain satisfactory reconstruction accuracy for
high spatial compression ratios (e.g., 64x). We address this challenge by
introducing two key techniques: (1) Residual Autoencoding, where we design our
models to learn residuals based on the space-to-channel transformed features to
alleviate the optimization difficulty of high spatial-compression autoencoders;
(2) Decoupled High-Resolution Adaptation, an efficient decoupled three-phases
training strategy for mitigating the generalization penalty of high
spatial-compression autoencoders. With these designs, we improve the
autoencoder's spatial compression ratio up to 128 while maintaining the
reconstruction quality. Applying our DC-AE to latent diffusion models, we
achieve significant speedup without accuracy drop. For example, on ImageNet
512x512, our DC-AE provides 19.1x inference speedup and 17.9x training speedup
on H100 GPU for UViT-H while achieving a better FID, compared with the widely
used SD-VAE-f8 autoencoder. Our code is available at
https://github.com/mit-han-lab/efficientvit.
comment: Preprint. First two authors contributed equally to this work. Update:
fix typo
♻ ☆ Generate E-commerce Product Background by Integrating Category Commonality and Personalized Style ICASSP 2025
The state-of-the-art methods for e-commerce product background generation
suffer from the inefficiency of designing product-wise prompts when scaling up
the production, as well as the ineffectiveness of describing fine-grained
styles when customizing personalized backgrounds for some specific brands. To
address these obstacles, we integrate the category commonality and personalized
style into diffusion models. Concretely, we propose a Category-Wise Generator
to enable large-scale background generation with only one model for the first
time. A unique identifier in the prompt is assigned to each category, whose
attention is located on the background by a mask-guided cross attention layer
to learn the category-wise style. Furthermore, for products with specific and
fine-grained requirements in layout, elements, etc, a Personality-Wise
Generator is devised to learn such personalized style directly from a reference
image to resolve textual ambiguities, and is trained in a self-supervised
manner for more efficient training data usage. To advance research in this
field, the first large-scale e-commerce product background generation dataset
BG60k is constructed, which covers more than 60k product images from over 2k
categories. Experiments demonstrate that our method could generate high-quality
backgrounds for different categories, and maintain the personalized background
style of reference images. BG60k will be available at
\url{https://github.com/Whileherham/BG60k}.
comment: Accepted by ICASSP 2025
♻ ☆ LayerAnimate: Layer-specific Control for Animation
Animated video separates foreground and background elements into layers, with
distinct processes for sketching, refining, coloring, and in-betweening.
Existing video generation methods typically treat animation as a monolithic
data domain, lacking fine-grained control over individual layers. In this
paper, we introduce LayerAnimate, a novel architectural approach that enhances
fine-grained control over individual animation layers within a video diffusion
model, allowing users to independently manipulate foreground and background
elements in distinct layers. To address the challenge of limited layer-specific
data, we propose a data curation pipeline that features automated element
segmentation, motion-state hierarchical merging, and motion coherence
refinement. Through quantitative and qualitative comparisons, and user study,
we demonstrate that LayerAnimate outperforms current methods in terms of
animation quality, control precision, and usability, making it an ideal tool
for both professional animators and amateur enthusiasts. This framework opens
up new possibilities for layer-specific animation applications and creative
flexibility. Our code is available at https://layeranimate.github.io.
comment: Project page: https://layeranimate.github.io
♻ ☆ A Survey on Deep Learning for Polyp Segmentation: Techniques, Challenges and Future Trends
Early detection and assessment of polyps play a crucial role in the
prevention and treatment of colorectal cancer (CRC). Polyp segmentation
provides an effective solution to assist clinicians in accurately locating and
segmenting polyp regions. In the past, people often relied on manually
extracted lower-level features such as color, texture, and shape, which often
had issues capturing global context and lacked robustness to complex scenarios.
With the advent of deep learning, more and more outstanding medical image
segmentation algorithms based on deep learning networks have emerged, making
significant progress in this field. This paper provides a comprehensive review
of polyp segmentation algorithms. We first review some traditional algorithms
based on manually extracted features and deep segmentation algorithms, then
detail benchmark datasets related to the topic. Specifically, we carry out a
comprehensive evaluation of recent deep learning models and results based on
polyp sizes, considering the pain points of research topics and differences in
network structures. Finally, we discuss the challenges of polyp segmentation
and future trends in this field. The models, benchmark datasets, and source
code links we collected are all published at
https://github.com/taozh2017/Awesome-Polyp-Segmentation.
comment: Have been published in Visual Intelligence
♻ ☆ Isolated Diffusion: Optimizing Multi-Concept Text-to-Image Generation Training-Freely with Isolated Diffusion Guidance
Large-scale text-to-image diffusion models have achieved great success in
synthesizing high-quality and diverse images given target text prompts. Despite
the revolutionary image generation ability, current state-of-the-art models
still struggle to deal with multi-concept generation accurately in many cases.
This phenomenon is known as ``concept bleeding" and displays as the unexpected
overlapping or merging of various concepts. This paper presents a general
approach for text-to-image diffusion models to address the mutual interference
between different subjects and their attachments in complex scenes, pursuing
better text-image consistency. The core idea is to isolate the synthesizing
processes of different concepts. We propose to bind each attachment to
corresponding subjects separately with split text prompts. Besides, we
introduce a revision method to fix the concept bleeding problem in
multi-subject synthesis. We first depend on pre-trained object detection and
segmentation models to obtain the layouts of subjects. Then we isolate and
resynthesize each subject individually with corresponding text prompts to avoid
mutual interference. Overall, we achieve a training-free strategy, named
Isolated Diffusion, to optimize multi-concept text-to-image synthesis. It is
compatible with the latest Stable Diffusion XL (SDXL) and prior Stable
Diffusion (SD) models. We compare our approach with alternative methods using a
variety of multi-concept text prompts and demonstrate its effectiveness with
clear advantages in text-image consistency and user study.
comment: Accepted by IEEE Transactions on Visualization and Computer Graphics
♻ ☆ Expression Prompt Collaboration Transformer for Universal Referring Video Object Segmentation
Audio-guided Video Object Segmentation (A-VOS) and Referring Video Object
Segmentation (R-VOS) are two highly related tasks that both aim to segment
specific objects from video sequences according to expression prompts. However,
due to the challenges of modeling representations for different modalities,
existing methods struggle to strike a balance between interaction flexibility
and localization precision. In this paper, we address this problem from two
perspectives: the alignment of audio and text and the deep interaction among
audio, text, and visual modalities. First, we propose a universal architecture,
the Expression Prompt Collaboration Transformer, herein EPCFormer. Next, we
propose an Expression Alignment (EA) mechanism for audio and text. The proposed
EPCFormer exploits the fact that audio and text prompts referring to the same
objects are semantically equivalent by using contrastive learning for both
types of expressions. Then, to facilitate deep interactions among audio, text,
and visual modalities, we introduce an Expression-Visual Attention (EVA)
module. The knowledge of video object segmentation in terms of the expression
prompts can seamlessly transfer between the two tasks by deeply exploring
complementary cues between text and audio. Experiments on well-recognized
benchmarks demonstrate that our EPCFormer attains state-of-the-art results on
both tasks. The source code will be made publicly available at
https://github.com/lab206/EPCFormer.
comment: Accepted to Knowledge-Based Systems (KBS). The source code will be
made publicly available at https://github.com/lab206/EPCFormer
♻ ☆ Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding
We introduce Tarsier2, a state-of-the-art large vision-language model (LVLM)
designed for generating detailed and accurate video descriptions, while also
exhibiting superior general video understanding capabilities. Tarsier2 achieves
significant advancements through three key upgrades: (1) Scaling pre-training
data from 11M to 40M video-text pairs, enriching both volume and diversity; (2)
Performing fine-grained temporal alignment during supervised fine-tuning; (3)
Using model-based sampling to automatically construct preference data and
applying DPO training for optimization. Extensive experiments show that
Tarsier2-7B consistently outperforms leading proprietary models, including
GPT-4o and Gemini 1.5 Pro, in detailed video description tasks. On the DREAM-1K
benchmark, Tarsier2-7B improves F1 by 2.8\% over GPT-4o and 5.8\% over
Gemini-1.5-Pro. In human side-by-side evaluations, Tarsier2-7B shows a +8.6\%
performance advantage over GPT-4o and +24.9\% over Gemini-1.5-Pro. Tarsier2-7B
also sets new state-of-the-art results across 15 public benchmarks, spanning
tasks such as video question-answering, video grounding, hallucination test,
and embodied question-answering, demonstrating its versatility as a robust
generalist vision-language model.
♻ ☆ Continuous Urban Change Detection from Satellite Image Time Series with Temporal Feature Refinement and Multi-Task Integration
Urbanization advances at unprecedented rates, resulting in negative effects
on the environment and human well-being. Remote sensing has the potential to
mitigate these effects by supporting sustainable development strategies with
accurate information on urban growth. Deep learning-based methods have achieved
promising urban change detection results from optical satellite image pairs
using convolutional neural networks (ConvNets), transformers, and a multi-task
learning setup. However, transformers have not been leveraged for urban change
detection with multi-temporal data, i.e., >2 images, and multi-task learning
methods lack integration approaches that combine change and segmentation
outputs. To fill this research gap, we propose a continuous urban change
detection method that identifies changes in each consecutive image pair of a
satellite image time series (SITS). Specifically, we propose a temporal feature
refinement (TFR) module that utilizes self-attention to improve ConvNet-based
multi-temporal building representations. Furthermore, we propose a multi-task
integration (MTI) module that utilizes Markov networks to find an optimal
building map time series based on segmentation and dense change outputs. The
proposed method effectively identifies urban changes based on high-resolution
SITS acquired by the PlanetScope constellation (F1 score 0.551) and Gaofen-2
(F1 score 0.440). Moreover, our experiments on two challenging datasets
demonstrate the effectiveness of the proposed method compared to bi-temporal
and multi-temporal urban change detection and segmentation methods.
comment: Under review at IEEE Transactions on Geoscience and Remote Sensing,
Code will be available at https://github.com/SebastianHafner/ContUrbanCD.git
♻ ☆ Mamba2D: A Natively Multi-Dimensional State-Space Model for Vision Tasks
Enis Baty, Alejandro Hernández Díaz, Chris Bridges, Rebecca Davidson, Steve Eckersley, Simon Hadfield
State-Space Models (SSMs) have recently emerged as a powerful and efficient
alternative to the long-standing transformer architecture. However, existing
SSM conceptualizations retain deeply rooted biases from their roots in natural
language processing. This constrains their ability to appropriately model the
spatially-dependent characteristics of visual inputs. In this paper, we address
these limitations by re-deriving modern selective state-space techniques,
starting from a natively multidimensional formulation. Currently, prior works
attempt to apply natively 1D SSMs to 2D data (i.e. images) by relying on
arbitrary combinations of 1D scan directions to capture spatial dependencies.
In contrast, Mamba2D improves upon this with a single 2D scan direction that
factors in both dimensions of the input natively, effectively modelling spatial
dependencies when constructing hidden states. Mamba2D shows comparable
performance to prior adaptations of SSMs for vision tasks, on standard image
classification evaluations with the ImageNet-1K dataset. Source code is
available at https://github.com/cocoalex00/Mamba2D.
♻ ☆ Model Synthesis for Zero-Shot Model Attribution
Nowadays, generative models are shaping various fields such as art, design,
and human-computer interaction, yet accompanied by challenges related to
copyright infringement and content management. In response, existing research
seeks to identify the unique fingerprints on the images they generate, which
can be leveraged to attribute the generated images to their source models.
Existing methods, however, are constrained to identifying models within a
static set included in the classifier training, failing to adapt to newly
emerged unseen models dynamically. To bridge this gap, we aim to develop a
generalized model fingerprint extractor capable of zero-shot attribution,
effectively attributes unseen models without exposure during training. Central
to our method is a model synthesis technique, which generates numerous
synthetic models mimicking the fingerprint patterns of real-world generative
models. The design of the synthesis technique is motivated by observations on
how the basic generative model's architecture building blocks and parameters
influence fingerprint patterns, and it is validated through two designed
metrics that examine synthetic models' fidelity and diversity. Our experiments
demonstrate that this fingerprint extractor, trained solely on synthetic
models, achieves impressive zero-shot generalization on a wide range of
real-world generative models, improving model identification and verification
accuracy on unseen models by over 40% and 15%, respectively, compared to
existing approaches.
comment: under review
♻ ☆ Multi-stage Deep Learning Artifact Reduction for Pallel-beam Computed Tomography
Computed Tomography (CT) using synchrotron radiation is a powerful technique
that, compared to lab-CT techniques, boosts high spatial and temporal
resolution while also providing access to a range of contrast-formation
mechanisms. The acquired projection data is typically processed by a
computational pipeline composed of multiple stages. Artifacts introduced during
data acquisition can propagate through the pipeline, and degrade image quality
in the reconstructed images. Recently, deep learning has shown significant
promise in enhancing image quality for images representing scientific data.
This success has driven increasing adoption of deep learning techniques in CT
imaging. Various approaches have been proposed to incorporate deep learning
into computational pipelines, but each has limitations in addressing artifacts
effectively and efficiently in synchrotron CT, either in properly addressing
the specific artifacts, or in computational efficiency.
Recognizing these challenges, we introduce a novel method that incorporates
separate deep learning models at each stage of the tomography
pipeline-projection, sinogram, and reconstruction-to address specific artifacts
locally in a data-driven way. Our approach includes bypass connections that
feed both the outputs from previous stages and raw data to subsequent stages,
minimizing the risk of error propagation. Extensive evaluations on both
simulated and real-world datasets illustrate that our approach effectively
reduces artifacts and outperforms comparison methods.
♻ ☆ IncSAR: A Dual Fusion Incremental Learning Framework for SAR Target Recognition
Deep learning techniques have achieved significant success in Synthetic
Aperture Radar (SAR) target recognition using predefined datasets in static
scenarios. However, real-world applications demand that models incrementally
learn new information without forgetting previously acquired knowledge. The
challenge of catastrophic forgetting, where models lose past knowledge when
adapting to new tasks, remains a critical issue. In this paper, we introduce
IncSAR, an incremental learning framework designed to tackle catastrophic
forgetting in SAR target recognition. IncSAR combines the power of a Vision
Transformer (ViT) and a custom-designed Convolutional Neural Network (CNN) in a
dual-branch architecture, integrated via a late-fusion strategy. Additionally,
we explore the use of TinyViT to reduce computational complexity and propose an
attention mechanism to dynamically enhance feature representation. To mitigate
the speckle noise inherent in SAR images, we employ a denoising module based on
a neural network approximation of Robust Principal Component Analysis (RPCA),
leveraging a simple neural network for efficient noise reduction in SAR
imagery. Moreover, a random projection layer improves the linear separability
of features, and a variant of Linear Discriminant Analysis (LDA) decorrelates
extracted class prototypes for better generalization. Extensive experiments on
the MSTAR, SAR-AIRcraft-1.0, and OpenSARShip benchmark datasets demonstrate
that IncSAR significantly outperforms state-of-the-art approaches, achieving a
99.63\% average accuracy and a 0.33\% performance drop, representing an 89\%
improvement in retention compared to existing techniques. The source code is
available at https://github.com/geokarant/IncSAR.
♻ ☆ VLSBench: Unveiling Visual Leakage in Multimodal Safety
Safety concerns of Multimodal large language models (MLLMs) have gradually
become an important problem in various applications. Surprisingly, previous
works indicate a counter-intuitive phenomenon that using textual unlearning to
align MLLMs achieves comparable safety performances with MLLMs trained with
image-text pairs. To explain such a counter-intuitive phenomenon, we discover a
visual safety information leakage (VSIL) problem in existing multimodal safety
benchmarks, i.e., the potentially risky and sensitive content in the image has
been revealed in the textual query. In this way, MLLMs can easily refuse these
sensitive text-image queries according to textual queries. However, image-text
pairs without VSIL are common in real-world scenarios and are overlooked by
existing multimodal safety benchmarks. To this end, we construct multimodal
visual leakless safety benchmark (VLSBench) preventing visual safety leakage
from image to textual query with 2.4k image-text pairs. Experimental results
indicate that VLSBench poses a significant challenge to both open-source and
close-source MLLMs, including LLaVA, Qwen2-VL, Llama3.2-Vision, and GPT-4o.
This study demonstrates that textual alignment is enough for multimodal safety
scenarios with VSIL, while multimodal alignment is a more promising solution
for multimodal safety scenarios without VSIL. Please see our code and data at:
https://hxhcreate.github.io/vlsbench.github.io/
♻ ☆ SARATR-X: Towards Building A Foundation Model for SAR Target Recognition
Despite the remarkable progress in synthetic aperture radar automatic target
recognition (SAR ATR), recent efforts have concentrated on detecting and
classifying a specific category, e.g., vehicles, ships, airplanes, or
buildings. One of the fundamental limitations of the top-performing SAR ATR
methods is that the learning paradigm is supervised, task-specific,
limited-category, closed-world learning, which depends on massive amounts of
accurately annotated samples that are expensively labeled by expert SAR
analysts and have limited generalization capability and scalability. In this
work, we make the first attempt towards building a foundation model for SAR
ATR, termed SARATR-X. SARATR-X learns generalizable representations via
self-supervised learning (SSL) and provides a cornerstone for label-efficient
model adaptation to generic SAR target detection and classification tasks.
Specifically, SARATR-X is trained on 0.18 M unlabelled SAR target samples,
which are curated by combining contemporary benchmarks and constitute the
largest publicly available dataset till now. Considering the characteristics of
SAR images, a backbone tailored for SAR ATR is carefully designed, and a
two-step SSL method endowed with multi-scale gradient features was applied to
ensure the feature diversity and model scalability of SARATR-X. The
capabilities of SARATR-X are evaluated on classification under few-shot and
robustness settings and detection across various categories and scenes, and
impressive performance is achieved, often competitive with or even superior to
prior fully supervised, semi-supervised, or self-supervised algorithms. Our
SARATR-X and the curated dataset are released at
https://github.com/waterdisappear/SARATR-X to foster research into foundation
models for SAR image interpretation.
comment: 20 pages, 9 figures
♻ ☆ Mitigating analytical variability in fMRI results with style transfer
We propose a novel approach to improve the reproducibility of neuroimaging
results by converting statistic maps across different functional MRI pipelines.
We make the assumption that pipelines used to compute fMRI statistic maps can
be considered as a style component and we propose to use different generative
models, among which, Generative Adversarial Networks (GAN) and Diffusion Models
(DM) to convert statistic maps across different pipelines. We explore the
performance of multiple GAN frameworks, and design a new DM framework for
unsupervised multi-domain styletransfer. We constrain the generation of 3D fMRI
statistic maps using the latent space of an auxiliary classifier that
distinguishes statistic maps from different pipelines and extend traditional
sampling techniques used in DM to improve the transition performance. Our
experiments demonstrate that our proposed methods aresuccessful: pipelines can
indeed be transferred as a style component, providing animportant source of
data augmentation for future medical studies.
♻ ☆ Accelerating lensed quasars discovery and modeling with physics-informed variational autoencoders
Irham T. Andika, Stefan Schuldt, Sherry H. Suyu, Satadru Bag, Raoul Cañameras, Alejandra Melo, Claudio Grillo, James H. H. Chan
Strongly lensed quasars provide valuable insights into the rate of cosmic
expansion, the distribution of dark matter in foreground deflectors, and the
characteristics of quasar hosts. However, detecting them in astronomical images
is difficult due to the prevalence of non-lensing objects. To address this
challenge, we developed a generative deep learning model called VariLens, built
upon a physics-informed variational autoencoder. This model seamlessly
integrates three essential modules: image reconstruction, object
classification, and lens modeling, offering a fast and comprehensive approach
to strong lens analysis. VariLens is capable of rapidly determining both (1)
the probability that an object is a lens system and (2) key parameters of a
singular isothermal ellipsoid (SIE) mass model -- including the Einstein radius
($\theta_\mathrm{E}$), lens center, and ellipticity -- in just milliseconds
using a single CPU. A direct comparison of VariLens estimates with traditional
lens modeling for 20 known lensed quasars within the Subaru Hyper Suprime-Cam
(HSC) footprint shows good agreement, with both results consistent within
$2\sigma$ for systems with $\theta_\mathrm{E}<3$ arcsecs. To identify new
lensed quasar candidates, we begin with an initial sample of approximately 80
million sources, combining HSC data with multiwavelength information from
various surveys. After applying a photometric preselection aimed at locating
$z>1.5$ sources, the number of candidates is reduced to 710,966. Subsequently,
VariLens highlights 13,831 sources, each showing a high likelihood of being a
lens. A visual assessment of these objects results in 42 promising candidates
that await spectroscopic confirmation. These results underscore the potential
of automated deep learning pipelines to efficiently detect and model strong
lenses in large datasets.
comment: Submitted to the Astronomy & Astrophysics journal and updated to
reflect the revised version. The paper consists of 17 main pages, 14 figures,
and 5 tables. We welcome feedback and comments from readers!
♻ ☆ WaveDH: Wavelet Sub-bands Guided ConvNet for Efficient Image Dehazing
The surge in interest regarding image dehazing has led to notable
advancements in deep learning-based single image dehazing approaches,
exhibiting impressive performance in recent studies. Despite these strides,
many existing methods fall short in meeting the efficiency demands of practical
applications. In this paper, we introduce WaveDH, a novel and compact ConvNet
designed to address this efficiency gap in image dehazing. Our WaveDH leverages
wavelet sub-bands for guided up-and-downsampling and frequency-aware feature
refinement. The key idea lies in utilizing wavelet decomposition to extract
low-and-high frequency components from feature levels, allowing for faster
processing while upholding high-quality reconstruction. The downsampling block
employs a novel squeeze-and-attention scheme to optimize the feature
downsampling process in a structurally compact manner through wavelet domain
learning, preserving discriminative features while discarding noise components.
In our upsampling block, we introduce a dual-upsample and fusion mechanism to
enhance high-frequency component awareness, aiding in the reconstruction of
high-frequency details. Departing from conventional dehazing methods that treat
low-and-high frequency components equally, our feature refinement block
strategically processes features with a frequency-aware approach. By employing
a coarse-to-fine methodology, it not only refines the details at frequency
levels but also significantly optimizes computational costs. The refinement is
performed in a maximum 8x downsampled feature space, striking a favorable
efficiency-vs-accuracy trade-off. Extensive experiments demonstrate that our
method, WaveDH, outperforms many state-of-the-art methods on several image
dehazing benchmarks with significantly reduced computational costs. Our code is
available at https://github.com/AwesomeHwang/WaveDH.
comment: Under Review
♻ ☆ Text-guided Image Restoration and Semantic Enhancement for Text-to-Image Person Retrieval
The goal of Text-to-Image Person Retrieval (TIPR) is to retrieve specific
person images according to the given textual descriptions. A primary challenge
in this task is bridging the substantial representational gap between visual
and textual modalities. The prevailing methods map texts and images into
unified embedding space for matching, while the intricate semantic
correspondences between texts and images are still not effectively constructed.
To address this issue, we propose a novel TIPR framework to build fine-grained
interactions and alignment between person images and the corresponding texts.
Specifically, via fine-tuning the Contrastive Language-Image Pre-training
(CLIP) model, a visual-textual dual encoder is firstly constructed, to
preliminarily align the image and text features. Secondly, a Text-guided Image
Restoration (TIR) auxiliary task is proposed to map abstract textual entities
to specific image regions, improving the alignment between local textual and
visual embeddings. Additionally, a cross-modal triplet loss is presented to
handle hard samples, and further enhance the model's discriminability for minor
differences. Moreover, a pruning-based text data augmentation approach is
proposed to enhance focus on essential elements in descriptions, thereby
avoiding excessive model attention to less significant information. The
experimental results show our proposed method outperforms state-of-the-art
methods on three popular benchmark datasets, and the code will be made publicly
available at https://github.com/Delong-liu-bupt/SEN.
comment: The paper was withdrawn due to a dispute among the authors regarding
the content of the article
♻ ☆ Text-guided Synthetic Geometric Augmentation for Zero-shot 3D Understanding
Kohei Torimi, Ryosuke Yamada, Daichi Otsuka, Kensho Hara, Yuki M. Asano, Hirokatsu Kataoka, Yoshimitsu Aoki
Zero-shot recognition models require extensive training data for
generalization. However, in zero-shot 3D classification, collecting 3D data and
captions is costly and laborintensive, posing a significant barrier compared to
2D vision. Recent advances in generative models have achieved unprecedented
realism in synthetic data production, and recent research shows the potential
for using generated data as training data. Here, naturally raising the
question: Can synthetic 3D data generated by generative models be used as
expanding limited 3D datasets? In response, we present a synthetic 3D dataset
expansion method, Textguided Geometric Augmentation (TeGA). TeGA is tailored
for language-image-3D pretraining, which achieves SoTA in zero-shot 3D
classification, and uses a generative textto-3D model to enhance and extend
limited 3D datasets. Specifically, we automatically generate text-guided
synthetic 3D data and introduce a consistency filtering strategy to discard
noisy samples where semantics and geometric shapes do not match with text. In
the experiment to double the original dataset size using TeGA, our approach
demonstrates improvements over the baselines, achieving zeroshot performance
gains of 3.0% on Objaverse-LVIS, 4.6% on ScanObjectNN, and 8.7% on ModelNet40.
These results demonstrate that TeGA effectively bridges the 3D data gap,
enabling robust zero-shot 3D classification even with limited real training
data and paving the way for zero-shot 3D vision application.
♻ ☆ SuperNeRF-GAN: A Universal 3D-Consistent Super-Resolution Framework for Efficient and Enhanced 3D-Aware Image Synthesis
Neural volume rendering techniques, such as NeRF, have revolutionized
3D-aware image synthesis by enabling the generation of images of a single scene
or object from various camera poses. However, the high computational cost of
NeRF presents challenges for synthesizing high-resolution (HR) images. Most
existing methods address this issue by leveraging 2D super-resolution, which
compromise 3D-consistency. Other methods propose radiance manifolds or
two-stage generation to achieve 3D-consistent HR synthesis, yet they are
limited to specific synthesis tasks, reducing their universality. To tackle
these challenges, we propose SuperNeRF-GAN, a universal framework for
3D-consistent super-resolution. A key highlight of SuperNeRF-GAN is its
seamless integration with NeRF-based 3D-aware image synthesis methods and it
can simultaneously enhance the resolution of generated images while preserving
3D-consistency and reducing computational cost. Specifically, given a
pre-trained generator capable of producing a NeRF representation such as
tri-plane, we first perform volume rendering to obtain a low-resolution image
with corresponding depth and normal map. Then, we employ a NeRF
Super-Resolution module which learns a network to obtain a high-resolution
NeRF. Next, we propose a novel Depth-Guided Rendering process which contains
three simple yet effective steps, including the construction of a
boundary-correct multi-depth map through depth aggregation, a normal-guided
depth super-resolution and a depth-guided NeRF rendering. Experimental results
demonstrate the superior efficiency, 3D-consistency, and quality of our
approach. Additionally, ablation studies confirm the effectiveness of our
proposed components.
♻ ☆ DX2CT: Diffusion Model for 3D CT Reconstruction from Bi or Mono-planar 2D X-ray(s)
Computational tomography (CT) provides high-resolution medical imaging, but
it can expose patients to high radiation. X-ray scanners have low radiation
exposure, but their resolutions are low. This paper proposes a new conditional
diffusion model, DX2CT, that reconstructs three-dimensional (3D) CT volumes
from bi or mono-planar X-ray image(s). Proposed DX2CT consists of two key
components: 1) modulating feature maps extracted from two-dimensional (2D)
X-ray(s) with 3D positions of CT volume using a new transformer and 2)
effectively using the modulated 3D position-aware feature maps as conditions of
DX2CT. In particular, the proposed transformer can provide conditions with rich
information of a target CT slice to the conditional diffusion model, enabling
high-quality CT reconstruction. Our experiments with the bi or mono-planar
X-ray(s) benchmark datasets show that proposed DX2CT outperforms several
state-of-the-art methods. Our codes and model will be available at:
https://www.github.com/intyeger/DX2CT.
♻ ☆ MoRe: Class Patch Attention Needs Regularization for Weakly Supervised Semantic Segmentation AAAI 2025
Weakly Supervised Semantic Segmentation (WSSS) with image-level labels
typically uses Class Activation Maps (CAM) to achieve dense predictions.
Recently, Vision Transformer (ViT) has provided an alternative to generate
localization maps from class-patch attention. However, due to insufficient
constraints on modeling such attention, we observe that the Localization
Attention Maps (LAM) often struggle with the artifact issue, i.e., patch
regions with minimal semantic relevance are falsely activated by class tokens.
In this work, we propose MoRe to address this issue and further explore the
potential of LAM. Our findings suggest that imposing additional regularization
on class-patch attention is necessary. To this end, we first view the attention
as a novel directed graph and propose the Graph Category Representation module
to implicitly regularize the interaction among class-patch entities. It ensures
that class tokens dynamically condense the related patch information and
suppress unrelated artifacts at a graph level. Second, motivated by the
observation that CAM from classification weights maintains smooth localization
of objects, we devise the Localization-informed Regularization module to
explicitly regularize the class-patch attention. It directly mines the token
relations from CAM and further supervises the consistency between class and
patch tokens in a learnable manner. Extensive experiments are conducted on
PASCAL VOC and MS COCO, validating that MoRe effectively addresses the artifact
issue and achieves state-of-the-art performance, surpassing recent single-stage
and even multi-stage methods. Code is available at
https://github.com/zwyang6/MoRe.
comment: AAAI 2025
♻ ☆ Elucidating the Design Space of Dataset Condensation NeurIPS 2024
Dataset condensation, a concept within data-centric learning, efficiently
transfers critical attributes from an original dataset to a synthetic version,
maintaining both diversity and realism. This approach significantly improves
model training efficiency and is adaptable across multiple application areas.
Previous methods in dataset condensation have faced challenges: some incur high
computational costs which limit scalability to larger datasets (e.g., MTT,
DREAM, and TESLA), while others are restricted to less optimal design spaces,
which could hinder potential improvements, especially in smaller datasets
(e.g., SRe2L, G-VBSM, and RDED). To address these limitations, we propose a
comprehensive design framework that includes specific, effective strategies
like implementing soft category-aware matching and adjusting the learning rate
schedule. These strategies are grounded in empirical evidence and theoretical
backing. Our resulting approach, Elucidate Dataset Condensation (EDC),
establishes a benchmark for both small and large-scale dataset condensation. In
our testing, EDC achieves state-of-the-art accuracy, reaching 48.6% on
ImageNet-1k with a ResNet-18 model at an IPC of 10, which corresponds to a
compression ratio of 0.78%. This performance exceeds those of SRe2L, G-VBSM,
and RDED by margins of 27.3%, 17.2%, and 6.6%, respectively.
comment: Accepted by NeurIPS 2024
♻ ☆ Harnessing small projectors and multiple views for efficient vision pretraining NeurIPS 2024
Recent progress in self-supervised (SSL) visual representation learning has
led to the development of several different proposed frameworks that rely on
augmentations of images but use different loss functions. However, there are
few theoretically grounded principles to guide practice, so practical
implementation of each SSL framework requires several heuristics to achieve
competitive performance. In this work, we build on recent analytical results to
design practical recommendations for competitive and efficient SSL that are
grounded in theory. Specifically, recent theory tells us that existing SSL
frameworks are minimizing the same idealized loss, which is to learn features
that best match the data similarity kernel defined by the augmentations used.
We show how this idealized loss can be reformulated to a functionally
equivalent loss that is more efficient to compute. We study the implicit bias
of using gradient descent to minimize our reformulated loss function and find
that using a stronger orthogonalization constraint with a reduced projector
dimensionality should yield good representations. Furthermore, the theory tells
us that approximating the reformulated loss should be improved by increasing
the number of augmentations, and as such using multiple augmentations should
lead to improved convergence. We empirically verify our findings on CIFAR, STL
and Imagenet datasets, wherein we demonstrate an improved linear readout
performance when training a ResNet-backbone using our theoretically grounded
recommendations. Remarkably, we also demonstrate that by leveraging these
insights, we can reduce the pretraining dataset size by up to 2$\times$ while
maintaining downstream accuracy simply by using more data augmentations. Taken
together, our work provides theoretically grounded recommendations that can be
used to improve SSL convergence and efficiency.
comment: Accepted to NeurIPS 2024
♻ ☆ OPCap:Object-aware Prompting Captioning
In the field of image captioning, the phenomenon where missing or nonexistent
objects are used to explain an image is referred to as object bias (or
hallucination). To mitigate this issue, we propose a target-aware prompting
strategy. This method first extracts object labels and their spatial
information from the image using an object detector. Then, an attribute
predictor further refines the semantic features of the objects. These refined
features are subsequently integrated and fed into the decoder, enhancing the
model's understanding of the image context. Experimental results on the COCO
and nocaps datasets demonstrate that OPCap effectively mitigates hallucination
and significantly improves the quality of generated captions.
♻ ☆ Driving in the Occupancy World: Vision-Centric 4D Occupancy Forecasting and Planning via World Models for Autonomous Driving AAAI2025
World models envision potential future states based on various ego actions.
They embed extensive knowledge about the driving environment, facilitating safe
and scalable autonomous driving. Most existing methods primarily focus on
either data generation or the pretraining paradigms of world models. Unlike the
aforementioned prior works, we propose Drive-OccWorld, which adapts a
vision-centric 4D forecasting world model to end-to-end planning for autonomous
driving. Specifically, we first introduce a semantic and motion-conditional
normalization in the memory module, which accumulates semantic and dynamic
information from historical BEV embeddings. These BEV features are then
conveyed to the world decoder for future occupancy and flow forecasting,
considering both geometry and spatiotemporal modeling. Additionally, we propose
injecting flexible action conditions, such as velocity, steering angle,
trajectory, and commands, into the world model to enable controllable
generation and facilitate a broader range of downstream applications.
Furthermore, we explore integrating the generative capabilities of the 4D world
model with end-to-end planning, enabling continuous forecasting of future
states and the selection of optimal trajectories using an occupancy-based cost
function. Comprehensive experiments conducted on the nuScenes,
nuScenes-Occupancy, and Lyft-Level5 datasets illustrate that our method can
generate plausible and controllable 4D occupancy, paving the way for
advancements in driving world generation and end-to-end planning. Project page:
https://drive-occworld.github.io/
comment: Accepted by AAAI2025
♻ ☆ Deep Plug-and-Play HIO Approach for Phase Retrieval
In the phase retrieval problem, the aim is the recovery of an unknown image
from intensity-only measurements such as Fourier intensity. Although there are
several solution approaches, solving this problem is challenging due to its
nonlinear and ill-posed nature. Recently, learning-based approaches have
emerged as powerful alternatives to the analytical methods for several inverse
problems. In the context of phase retrieval, a novel plug-and-play approach
that exploits learning-based prior and efficient update steps has been
presented at the Computational Optical Sensing and Imaging topical meeting,
with demonstrated state-of-the-art performance. The key idea was to incorporate
learning-based prior to the Gerchberg-Saxton type algorithms through
plug-and-play regularization. In this paper, we present the mathematical
development of the method including the derivation of its analytical update
steps based on half-quadratic splitting and comparatively evaluate its
performance through extensive simulations on a large test dataset. The results
show the effectiveness of the method in terms of both image quality,
computational efficiency, and robustness to initialization and noise.
comment: 16 pages, 5 figures
♻ ☆ Instruction-Guided Fusion of Multi-Layer Visual Features in Large Vision-Language Models
Large Vision-Language Models (LVLMs) have achieved remarkable success in a
wide range of multimodal tasks by integrating pre-trained vision encoders and
large language models. However, current LVLMs primarily rely on visual features
extracted from the final layers of the vision encoder, overlooking the
complementary information available in shallower layers. While recent
approaches have explored the use of multilayer visual features in LVLMs, they
tend to be task-agnostic and fail to examine the dependencies of hierarchical
visual features on specific tasks. To address these gaps, we systematically
investigate the contributions of visual features from different encoder layers
using 18 benchmarks spanning 6 task categories. Our findings reveal that
multilayer features provide complementary strengths with varying task
dependencies, and uniform fusion leads to suboptimal performance. Building on
these insights, we propose the instruction-guided vision aggregator, a module
that dynamically integrates multi-layer visual features based on textual
instructions, without increasing the number of visual tokens. Extensive
evaluations demonstrate the superior performance of our method. Additionally,
an in-depth analysis of the aggregator's behavior highlights the dominance of
mid-to-high-level features in semantic-rich tasks and the critical role of
low-level features in fine-grained perception.
♻ ☆ Myriad: Large Multimodal Model by Applying Vision Experts for Industrial Anomaly Detection
Yuanze Li, Haolin Wang, Shihao Yuan, Ming Liu, Debin Zhao, Yiwen Guo, Chen Xu, Guangming Shi, Wangmeng Zuo
Due to the training configuration, traditional industrial anomaly detection
(IAD) methods have to train a specific model for each deployment scenario,
which is insufficient to meet the requirements of modern design and
manufacturing. On the contrary, large multimodal models~(LMMs) have shown
eminent generalization ability on various vision tasks, and their perception
and comprehension capabilities imply the potential of applying LMMs on IAD
tasks. However, we observe that even though the LMMs have abundant knowledge
about industrial anomaly detection in the textual domain, the LMMs are unable
to leverage the knowledge due to the modality gap between textual and visual
domains. To stimulate the relevant knowledge in LMMs and adapt the LMMs towards
anomaly detection tasks, we introduce existing IAD methods as vision experts
and present a novel large multimodal model applying vision experts for
industrial anomaly detection~(abbreviated to {Myriad}). Specifically, we
utilize the anomaly map generated by the vision experts as guidance for LMMs,
such that the vision model is guided to pay more attention to anomalous
regions. Then, the visual features are modulated via an adapter to fit the
anomaly detection tasks, which are fed into the language model together with
the vision expert guidance and human instructions to generate the final
outputs. Extensive experiments are applied on MVTec-AD, VisA, and PCB Bank
benchmarks demonstrate that our proposed method not only performs favorably
against state-of-the-art methods, but also inherits the flexibility and
instruction-following ability of LMMs in the field of IAD. Source code and
pre-trained models are publicly available at
\url{https://github.com/tzjtatata/Myriad}.
comment: 8 pages, 7 figures
♻ ☆ TraceFL: Interpretability-Driven Debugging in Federated Learning via Neuron Provenance ICSE
In Federated Learning, clients train models on local data and send updates to
a central server, which aggregates them into a global model using a fusion
algorithm. This collaborative yet privacy-preserving training comes at a cost.
FL developers face significant challenges in attributing global model
predictions to specific clients. Localizing responsible clients is a crucial
step towards (a) excluding clients primarily responsible for incorrect
predictions and (b) encouraging clients who contributed high-quality models to
continue participating in the future. Existing ML debugging approaches are
inherently inapplicable as they are designed for single-model, centralized
training.
We introduce TraceFL, a fine-grained neuron provenance capturing mechanism
that identifies clients responsible for a global model's prediction by tracking
the flow of information from individual clients to the global model. Since
inference on different inputs activates a different set of neurons of the
global model, TraceFL dynamically quantifies the significance of the global
model's neurons in a given prediction, identifying the most crucial neurons in
the global model. It then maps them to the corresponding neurons in every
participating client to determine each client's contribution, ultimately
localizing the responsible client. We evaluate TraceFL on six datasets,
including two real-world medical imaging datasets and four neural networks,
including advanced models such as GPT. TraceFL achieves 99% accuracy in
localizing the responsible client in FL tasks spanning both image and text
classification tasks. At a time when state-of-the-artML debugging approaches
are mostly domain-specific (e.g., image classification only), TraceFL is the
first technique to enable highly accurate automated reasoning across a wide
range of FL applications.
comment: Accepted at 2025 IEEE/ACM 47th International Conference on Software
Engineering (ICSE)
♻ ☆ Epicardium Prompt-guided Real-time Cardiac Ultrasound Frame-to-volume Registration MICCAI 2024
Long Lei, Jun Zhou, Jialun Pei, Baoliang Zhao, Yueming Jin, Yuen-Chun Jeremy Teoh, Jing Qin, Pheng-Ann Heng
A comprehensive guidance view for cardiac interventional surgery can be
provided by the real-time fusion of the intraoperative 2D images and
preoperative 3D volume based on the ultrasound frame-to-volume registration.
However, cardiac ultrasound images are characterized by a low signal-to-noise
ratio and small differences between adjacent frames, coupled with significant
dimension variations between 2D frames and 3D volumes to be registered,
resulting in real-time and accurate cardiac ultrasound frame-to-volume
registration being a very challenging task. This paper introduces a lightweight
end-to-end Cardiac Ultrasound frame-to-volume Registration network, termed
CU-Reg. Specifically, the proposed model leverages epicardium prompt-guided
anatomical clues to reinforce the interaction of 2D sparse and 3D dense
features, followed by a voxel-wise local-global aggregation of enhanced
features, thereby boosting the cross-dimensional matching effectiveness of
low-quality ultrasound modalities. We further embed an inter-frame
discriminative regularization term within the hybrid supervised learning to
increase the distinction between adjacent slices in the same ultrasound volume
to ensure registration stability. Experimental results on the reprocessed CAMUS
dataset demonstrate that our CU-Reg surpasses existing methods in terms of
registration accuracy and efficiency, meeting the guidance requirements of
clinical cardiac interventional surgery.
comment: This paper has been accepted by MICCAI 2024
♻ ☆ NeuManifold: Neural Watertight Manifold Reconstruction with Efficient and High-Quality Rendering Support
We present a method for generating high-quality watertight manifold meshes
from multi-view input images. Existing volumetric rendering methods are robust
in optimization but tend to generate noisy meshes with poor topology.
Differentiable rasterization-based methods can generate high-quality meshes but
are sensitive to initialization. Our method combines the benefits of both
worlds; we take the geometry initialization obtained from neural volumetric
fields, and further optimize the geometry as well as a compact neural texture
representation with differentiable rasterizers. Through extensive experiments,
we demonstrate that our method can generate accurate mesh reconstructions with
faithful appearance that are comparable to previous volume rendering methods
while being an order of magnitude faster in rendering. We also show that our
generated mesh and neural texture reconstruction is compatible with existing
graphics pipelines and enables downstream 3D applications such as simulation.
Project page: https://sarahweiii.github.io/neumanifold/
comment: Project page: https://sarahweiii.github.io/neumanifold/
♻ ☆ FireANTs: Adaptive Riemannian Optimization for Multi-Scale Diffeomorphic Matching
The paper proposes FireANTs, the first multi-scale Adaptive Riemannian
Optimization algorithm for dense diffeomorphic image matching. One of the most
critical and understudied aspects of diffeomorphic image matching algorithms
are its highly ill-conditioned nature. We quantitatively capture the extent of
ill-conditioning in a typical MRI matching task, motivating the need for an
adaptive optimization algorithm for diffeomorphic matching. To this end,
FireANTs generalizes the concept of momentum and adaptive estimates of the
Hessian to mitigate this ill-conditioning in the non-Euclidean space of
diffeomorphisms. Unlike common non-Euclidean manifolds, we also formalize
considerations for multi-scale optimization of diffeomorphisms. Our rigorous
mathematical results and operational contributions lead to a state-of-the-art
dense matching algorithm that can be applied to generic image data with
remarkable accuracy and robustness. We demonstrate consistent improvements in
image matching performance across a spectrum of community-standard medical and
biological correspondence matching challenges spanning a wide variety of image
modalities, anatomies, resolutions, acquisition protocols, and preprocessing
pipelines. This improvement is supplemented by from 300x up to 3200x speedup
over existing state-of-the-art algorithms. For the first time, we perform
diffeomorphic matching of sub-micron mouse cortex volumes at native resolution.
Our fast implementation also enables hyperparameter studies that were
intractable with existing correspondence matching algorithms.
♻ ☆ Learnable Scaled Gradient Descent for Guaranteed Robust Tensor PCA
Robust tensor principal component analysis (RTPCA) aims to separate the
low-rank and sparse components from multi-dimensional data, making it an
essential technique in the signal processing and computer vision fields.
Recently emerging tensor singular value decomposition (t-SVD) has gained
considerable attention for its ability to better capture the low-rank structure
of tensors compared to traditional matrix SVD. However, existing methods often
rely on the computationally expensive tensor nuclear norm (TNN), which limits
their scalability for real-world tensors. To address this issue, we explore an
efficient scaled gradient descent (SGD) approach within the t-SVD framework for
the first time, and propose the RTPCA-SGD method. Theoretically, we rigorously
establish the recovery guarantees of RTPCA-SGD under mild assumptions,
demonstrating that with appropriate parameter selection, it achieves linear
convergence to the true low-rank tensor at a constant rate, independent of the
condition number. To enhance its practical applicability, we further propose a
learnable self-supervised deep unfolding model, which enables effective
parameter learning. Numerical experiments on both synthetic and real-world
datasets demonstrate the superior performance of the proposed methods while
maintaining competitive computational efficiency, especially consuming less
time than RTPCA-TNN.
♻ ☆ Empowering Large Language Model for Continual Video Question Answering with Collaborative Prompting EMNLP 2024
In recent years, the rapid increase in online video content has underscored
the limitations of static Video Question Answering (VideoQA) models trained on
fixed datasets, as they struggle to adapt to new questions or tasks posed by
newly available content. In this paper, we explore the novel challenge of
VideoQA within a continual learning framework, and empirically identify a
critical issue: fine-tuning a large language model (LLM) for a sequence of
tasks often results in catastrophic forgetting. To address this, we propose
Collaborative Prompting (ColPro), which integrates specific question constraint
prompting, knowledge acquisition prompting, and visual temporal awareness
prompting. These prompts aim to capture textual question context, visual
content, and video temporal dynamics in VideoQA, a perspective underexplored in
prior research. Experimental results on the NExT-QA and DramaQA datasets show
that ColPro achieves superior performance compared to existing approaches,
achieving 55.14\% accuracy on NExT-QA and 71.24\% accuracy on DramaQA,
highlighting its practical relevance and effectiveness.
comment: Accepted by main EMNLP 2024
♻ ☆ IOR: Inversed Objects Replay for Incremental Object Detection
Existing Incremental Object Detection (IOD) methods partially alleviate
catastrophic forgetting when incrementally detecting new objects in real-world
scenarios. However, many of these methods rely on the assumption that unlabeled
old-class objects may co-occur with labeled new-class objects in the
incremental data. When unlabeled old-class objects are absent, the performance
of existing methods tends to degrade. The absence can be mitigated by
generating old-class samples, but it incurs high costs. This paper argues that
previous generation-based IOD suffers from redundancy, both in the use of
generative models, which require additional training and storage, and in the
overproduction of generated samples, many of which do not contribute
significantly to performance improvements. To eliminate the redundancy, we
propose Inversed Objects Replay (IOR). Specifically, we generate old-class
samples by inversing the original detectors, thus eliminating the necessity of
training and storing additional generative models. We propose augmented replay
to reuse the objects in generated samples, reducing redundant generations.
Moreover, we propose high-value knowledge distillation focusing on the
positions of old-class objects overwhelmed by the background, which transfers
the knowledge to the incremental detector. Extensive experiments conducted on
MS COCO 2017 demonstrate that our method can efficiently improve detection
performance in IOD scenarios with the absence of old-class objects.
♻ ☆ Challenge Summary U-MedSAM: Uncertainty-aware MedSAM for Medical Image Segmentation
Medical Image Foundation Models have proven to be powerful tools for mask
prediction across various datasets. However, accurately assessing the
uncertainty of their predictions remains a significant challenge. To address
this, we propose a new model, U-MedSAM, which integrates the MedSAM model with
an uncertainty-aware loss function and the Sharpness-Aware Minimization
(SharpMin) optimizer. The uncertainty-aware loss function automatically
combines region-based, distribution-based, and pixel-based loss designs to
enhance segmentation accuracy and robustness. SharpMin improves generalization
by finding flat minima in the loss landscape, thereby reducing overfitting. Our
method was evaluated in the CVPR24 MedSAM on Laptop challenge, where U-MedSAM
demonstrated promising performance.
comment: arXiv admin note: text overlap with arXiv:2405.17496
♻ ☆ MECD+: Unlocking Event-Level Causal Graph Discovery for Video Reasoning NeurIPS 2024
Video causal reasoning aims to achieve a high-level understanding of videos
from a causal perspective. However, it exhibits limitations in its scope,
primarily executed in a question-answering paradigm and focusing on brief video
segments containing isolated events and basic causal relations, lacking
comprehensive and structured causality analysis for videos with multiple
interconnected events. To fill this gap, we introduce a new task and dataset,
Multi-Event Causal Discovery (MECD). It aims to uncover the causal relations
between events distributed chronologically across long videos. Given visual
segments and textual descriptions of events, MECD identifies the causal
associations between these events to derive a comprehensive and structured
event-level video causal graph explaining why and how the result event
occurred. To address the challenges of MECD, we devise a novel framework
inspired by the Granger Causality method, incorporating an efficient mask-based
event prediction model to perform an Event Granger Test. It estimates causality
by comparing the predicted result event when premise events are masked versus
unmasked. Furthermore, we integrate causal inference techniques such as
front-door adjustment and counterfactual inference to mitigate challenges in
MECD like causality confounding and illusory causality. Additionally, context
chain reasoning is introduced to conduct more robust and generalized reasoning.
Experiments validate the effectiveness of our framework in reasoning complete
causal relations, outperforming GPT-4o and VideoChat2 by 5.77% and 2.70%,
respectively. Further experiments demonstrate that causal relation graphs can
also contribute to downstream video understanding tasks such as video question
answering and video event prediction.
comment: IEEE TPAMI Submission. continuous work of arXiv:2409.17647 (NeurIPS
2024)
♻ ☆ LADDER: Language Driven Slice Discovery and Error Rectification
Error slice discovery is crucial to diagnose and mitigate model errors.
Current clustering or discrete attribute-based slice discovery methods face key
limitations: 1) clustering results in incoherent slices, while assigning
discrete attributes to slices leads to incomplete coverage of error patterns
due to missing or insufficient attributes; 2) these methods lack complex
reasoning, preventing them from fully explaining model biases; 3) they fail to
integrate \textit{domain knowledge}, limiting their usage in specialized fields
\eg radiology. We propose\ladder (\underline{La}nguage-\underline{D}riven
\underline{D}iscovery and \underline{E}rror \underline{R}ectification), to
address the limitations by: (1) leveraging the flexibility of natural language
to address incompleteness, (2) employing LLM's latent \textit{domain knowledge}
and advanced reasoning to analyze sentences and derive testable hypotheses
directly, identifying biased attributes, and form coherent error slices without
clustering. Existing mitigation methods typically address only the
worst-performing group, often amplifying errors in other subgroups. In
contrast,\ladder generates pseudo attributes from the discovered hypotheses to
mitigate errors across all biases without explicit attribute annotations or
prior knowledge of bias. Rigorous evaluations on 6 datasets spanning natural
and medical images -- comparing 200+ classifiers with diverse architectures,
pretraining strategies, and LLMs -- show that\ladder consistently outperforms
existing baselines in discovering and mitigating biases.
♻ ☆ MetaNeRV: Meta Neural Representations for Videos with Spatial-Temporal Guidance AAAI2025
Neural Representations for Videos (NeRV) has emerged as a promising implicit
neural representation (INR) approach for video analysis, which represents
videos as neural networks with frame indexes as inputs. However, NeRV-based
methods are time-consuming when adapting to a large number of diverse videos,
as each video requires a separate NeRV model to be trained from scratch. In
addition, NeRV-based methods spatially require generating a high-dimension
signal (i.e., an entire image) from the input of a low-dimension timestamp, and
a video typically consists of tens of frames temporally that have a minor
change between adjacent frames. To improve the efficiency of video
representation, we propose Meta Neural Representations for Videos, named
MetaNeRV, a novel framework for fast NeRV representation for unseen videos.
MetaNeRV leverages a meta-learning framework to learn an optimal parameter
initialization, which serves as a good starting point for adapting to new
videos. To address the unique spatial and temporal characteristics of video
modality, we further introduce spatial-temporal guidance to improve the
representation capabilities of MetaNeRV. Specifically, the spatial guidance
with a multi-resolution loss aims to capture the information from different
resolution stages, and the temporal guidance with an effective progressive
learning strategy could gradually refine the number of fitted frames during the
meta-learning process. Extensive experiments conducted on multiple datasets
demonstrate the superiority of MetaNeRV for video representations and video
compression.
comment: Accepted by AAAI2025
♻ ☆ Keep It Accurate and Robust: An Enhanced Nuclei Analysis Framework
Accurate segmentation and classification of nuclei in histology images is
critical but challenging due to nuclei heterogeneity, staining variations, and
tissue complexity. Existing methods often struggle with limited dataset
variability, with patches extracted from similar whole slide images (WSI),
making models prone to falling into local optima. Here we propose a new
framework to address this limitation and enable robust nuclear analysis. Our
method leverages dual-level ensemble modeling to overcome issues stemming from
limited dataset variation. Intra-ensembling applies diverse transformations to
individual samples, while inter-ensembling combines networks of different
scales. We also introduce enhancements to the HoVer-Net architecture, including
updated encoders, nested dense decoding and model regularization strategy. We
achieve state-of-the-art results on public benchmarks, including 1st place for
nuclear composition prediction and 3rd place for segmentation/classification in
the 2022 Colon Nuclei Identification and Counting (CoNIC) Challenge. This
success validates our approach for accurate histological nuclei analysis.
Extensive experiments and ablation studies provide insights into optimal
network design choices and training techniques. In conclusion, this work
proposes an improved framework advancing the state-of-the-art in nuclei
analysis. We release our code and models
(https://github.com/WinnieLaugh/CONIC_Pathology_AI) to serve as a toolkit for
the community.